REMARKS 

The Official Action dated November 12, 2002 has been carefully considered. 
Accordingly, the changes presented herewith, taken with the following remarks, are believed 
sufficient to place the present application in condition for allowance. Reconsideration is 
respectfully requested. 

By present amendment, in compliance with the Examiner's request, a paper copy of 
the "Sequence Listing" is entered into the specification. Also submitted on even date, to Box 
Sequence Listing, is a computer readable form (CRF) copy of the sequence listing. 
Applicants submit that the information recorded in the CRF is identical to the written 
sequence listing added by the present amendment. Additionally, it is believed that the neither 
the paper sequence listing nor the sequence listing in CRF include new matter, whereby entry 
is believed to be in order. Also by present amendment, references to particular SEQ ID NOs 
were added in compliance with 37 C.F.R. §1.821 to both the specification and claims 9 and 
10. Claim 1 is amended in order to clarify that the molecule segment contributing to a 
disordered structure which is deleted is terminal, as previously recited in claim 4. In addition, 
the phrase "of the Class I Cytokine family" is added to claims 1 and 2, in order to define the p> 
invention. Support for this amendment can be found at page 6, lines 3-10, describing the 
scope of the inventive cytokines as comprising the hematopoietin receptor superfamily, 
which is commonly designated by those skilled in the art as the Class I Cytokine family. The 
preambles of claims 2, 6-10, 42 and 43 were amended to delete the word "modified" in order 
to clarify that the hGHR molecule itself is modified. It is believed that these changes do not 
involve any introduction of new matter, whereby entry is believed to be in order and is 
respectfully requested. 

Claims 1, 2, 4-9 and 42-43 were rejected under 35 USC §112, first paragraph, as 
containing subject matter which was not described in the specification in such a way as to 
enable one skilled in the art to which it pertains to make and/or use the invention 
commensurate in scope with the claims. Specifically, the Examiner asserts that the instant 
specification, while being enabling for a modified human growth hormone receptor (hGHR) 
consisting of residues 32-237 or 32-234 of the native hGHR molecule, capable of being 
crystallized without being complexed to a ligand molecule, does not reasonably provide 
enablement for a cytokine receptor protein modified in the extracellular domain capable of 



being crystallized without being complexed to a ligand molecule. More particularly, the 
Examiner asserts that the instant specification fails to provide any guidance as to how to 
generate crystals of a cytokine receptor which is modified in the extracellular domain by 
deletion of a molecular segment which contributes to a disordered structure. Further, the 
Examiner maintains that there is no evidence or sound scientific reasoning presented in either 
the instant specification or the prior art that would support a conclusion that such 
crystallization of any cytokine receptor is possible or was ever achieved because all the 
teachings of the instant specification are directed to a very specific segment of only one 
example of a cytokine receptor, which is human growth hormone receptor hGHRi.237. The 
Examiner concludes that the instant specification fails to provide any guidance either on how 
to modify any given disclosed or, as yet, undiscovered cytokine receptor protein, or to permit 
an artisan to predict which segments of a receptor protein contribute to a disordered structure, 
thus necessitating an undue amount of experimentation in order to practice the invention. 

This rejection is traversed. Applicants submit that the present amendments clarify the 
scope of the invention, and that the disclosure confers the ability to one of ordinary skill in 
the art to make and use the invention, commensurate with this scope. Therefore, the rejection 
is overcome and reconsideration is respectfully requested. 

In particular, claim 1 is directed to a cytokine receptor protein of the Class I Cytokine 
family, modified in the extracellular domain, wherein at least one terminal segment which 
contributes to a disordered structure is deleted, the modified protein being capable of being 
crystallized without being complexed to a ligand molecule. Claim 42 is directed to a similar 
protein, specified as human growth hormone. Applicants submit that the determination of 
which segments of a molecule contribute to disordered structure is routine and predictable, 
thus providing precise guidance as to what constitutes the contemplated modifications that 
enable the invention. The Examiner's concern that the internal disordered regions critical to 
binding activity, which cannot be removed without negatively affecting binding activity, may 
not be determinable without undue experimentation, is moot. That is, the internal disordered 
regions that function to allow movement and binding conformation between rigid tertiary 
structures do not comprise the recited deleted segment, as claims 1 and 42 recite a terminal ' 
molecule segment which contributes to a disordered structure is deleted. 

It is well-within the ability of one of ordinary skill in the art to determine which 
terminal regions of a Class I cytokine receptor contribute to a disordered structure. 




Applicants submit herewith several pertinent publications which establish this ability, 
namely, (1) "Structural Mechanisms for Domain Movements in Proteins" Gerstein, Mark; 
Lesk, Arthur M.; Chothia, Cyrus, Biochemistry, Vol. 33, No. 22, pp. 6739-6749 (1994), (2) 
"Improved Prediction of Protein Secondary Structure by use of Sequence Profiles and Neural 
Networks" Rost, Burkhard; Sander, Chris, Proc. Natl. Acad. Sci. USA, Vol. 90, pp. 7558- 
7562 (1993), (3) "Accuracy of Protein Flexibility Predictions" Vihinen, Muano; Torkkila, 
Esa; Riikonen, Pentti, PROTEINS: Structure, Function, and Genetics, Vol. 19, pp. 141-149 
(1994), (4) "Hybrid System for Protein Secondary Structure Prediction" Zhang, Xiru; 
Mesirov, Jill P.; Waltz, David, J. Mol. Biol., Vol. 225, pp. 1049-1063 (1992), (5) "Rigid 
Domains in Proteins: An Algorithmic Approach to Their Identification" Nichols, William; 
Rose, George D.; Ten-Eyck, Lynn; Zimm, Bruno H., PROTEINS: Structure, Function, and 
Genetics, Vol. 23, pp. 38-48 (1995), (6) "Detection of Common Three-Dimensional 
Substructures in Proteins" Vriend, Gerrit; Sander, Chris, PROTEINS: Structure, Function, 
and Genetics, Vol. 11, pp. 52-58 (1991), (7) "Yeast Heat Shock Transcription Factor N- 
terminal Activation Domains are Unstructured as Probed by Heteronuclear NMR 
Spectroscopy" Cho, Ho; Liu, Corey W.; Damberger, Fred F.; Pelton, Jeffrey G.; Nelson, 
Hillary; Wemmer, David E., Protein Science, Vol. 5, pp. 262-269 (1996), (8) "Identifying 
Disordered Regions in Proteins from Amino Acid Sequences" Romero, P; Obradovic, Z; 
Kissinger C.R.; Villafranca, J.E., Dunker, A.K., Proc.LE.E.E. International Conference on 
Neural Networks, Vol. 1, pp. 90-95 (1997), (9) "Protein Structure Prediction and Design" 
Morea, Veronica; Leplae, Raphael; Tramontano, Anna, Biotechnology Annu Rev., Vol. 4, 
pp. 177-214 (1998), and (10) "Predicting Protein Disorder for N-, C- and Internal Regions" 
e-publication at http://www.jsbi.org/journal/GIW99/GIW99F04.pdf. There are repeated 
references within these articles to databanks of NMR and X-ray crystallographic-derived 
disordered regions within proteins. Drawing on this data, neural network models, which 
exploit the very predictability which the Examiner denies, have become ubiquitous. It is clear 
from inspection of these articles that determining, either via neural network models or 
empirically, whether a terminal region comprises a molecule segment which contributes to a 
disordered state is straightforward and can be done with a reasonable expectation of success. 

Additionally, independent claim 1 is directed to Class I Cytokine Family receptor 
proteins. The striking homology of the extra-cellular domain of these proteins serves as the 
basis for this receptor classification, and, therefore, the basis inherently applies to those 
members of the family not yet discovered. 
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In summary, Applicants submit that a person of ordinary skill in the art has a 
reasonable expectation of success in practicing the present invention as defined by claims 1 
and 42 because: 1) terminal segments of extracellular domains are precisely locatable 
regions, 2) deletion is easily accomplished by means well-known in the art, 3) regions 
contributing to disorder are readily identifiable, and 4) Class I Cytokines are defined by a 
structural conservation that confers a high degree of predictability as to the effect of structural 
modifications. Predictably, then, a person of ordinary skill in the art has a reasonable 
expectation that the receptor protein modified according to the present invention will be 
capable of being crystallized without being complexed to a ligand molecule. 

It is therefore submitted that claims 1, 2, 5-9 and 42-43 are enabled by the 
specification in accordance with 35 U.S.C. §112, first paragraph, and that the rejection has 
been overcome. Reconsideration is respectfully requested. 

Claims 1, 2 and 6 were rejected under 35 U.S.C. §112, second paragraph, as being 
indefinite. In particular, the Examiner asserts that claim 1 is indefinite because the recitation 
of "a molecule segment which contributes to a disordered structure" is considered vague and 
ambiguous, "since not every 'molecule segment which contributes to a disordered structure' is 
suitable for deletion to achieve crystallization." 

This rejection is traversed. Applicants submit that this quote was intended to illustrate 
that, typically, internal disordered regions are relevant to binding conformation, and deletion 
may impact subsequent crystallization effort. However, as claims 1 and 42 recite a "terminal 
molecule segment which contributes to disordered structure", these claims are definite in 
accordance with 35 U.S.C. §112, second paragraph, and the rejection has been overcome. 
Reconsideration is therefore respectfully requested. 

Claims 8-10 were rejected under 35 U.S.C. §112, second paragraph, as being 
indefinite. Specifically, the Examiner asserts that the preamble recitation "a modified 
growth hormone receptor" is vague and indefinite because it is not clear and cannot be 
determined from the claim what other possible modifications except truncation of the C- 
terminal end are encompassed by the claim. 

This rejection is traversed. Applicants have amended claims 2, 5-10, 42 and 43 to 
remove "modified" from the preamble thereby clarifying the claims. Hence, the claims are 





definite in accordance with 35 U.S.C. §112, second paragraph, whereby the rejection is 
overcome and reconsideration is respectfully requested. 

The Examiner requested that Applicants submit a computer readable form (CRF) 
copy of a "Sequence Listing" which includes all of the sequences that are present in the 
instant application, a paper copy of the "Sequence Listing", an amendment directing entry of 
the paper copy into the specification, appropriate statements regarding content and the 
absence of new matter, and amendments to the instant specification and claims to comply 
with 37 C.F.R.§ 1.821(d) which requires reference to a particular SEQ ID NO in the 
specification and claims wherever a reference is made to that sequence. Without agreeing 
with the Examiner's basis, i.e., that the modifications of the claimed specific segment of 
hGHR are indefinite, Applicants believe they have fully complied with this request, as 
detailed supra. 

It is believed that the above represents a complete response to the rejections and 
requests set forth in the Official Action, and places the present application in condition for 
allowance. Reconsideration and an early allowance are requested. 



Respectfully submitted, 




Attorney for Applicants 



1900 Chemed Center 
255 East Fifth Street 
Cincinnati, Ohio 45202 
(513) 977-8568 



875252vl 
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VERSION WITH MARKINGS SHOWING CHANG ES MADE 
In the Specification : 

The specification is amended as follows: 

The paragraph at page 2, line 19 through page 3, line 15 is amended as follows: 

-According to a first aspect, the present invention is directed to a modified 
extracellular domain of a cytokine receptor protein, capable of being crystallized without 
being complexed to a ligand molecule. These modified proteins substantially maintain their 
activity to their native ligands and they will therefore constitute powerful tools for ligand 
interaction studies. The inventive, modified cytokine receptor preferably is of the type which 
oligomerizes when being bound to a ligand. This may include heterooligomerization of 
homodimerization, as discussed in Mol. Cell. Biol, 1994, Vol. 14(6), p.3535-49: S Watowich 
et al. Most preferably, the modified receptor is a homodimeric cytokine receptor, such as the 
growth hormone receptor (hGHR) having an extracellular part consisting of 237 amino acids 
in its native state. The inventive proteins have at least one molecule segment contributing to 
a disordered structure deleted. Preferably, the deletion results in a truncation in at least one 
terminal end and most preferably it is truncated both in its C-terminal end and in its N- 
terminal end. More preferably, the inventive proteins are modified human growth hormone 
receptors (hGHR) with 31 or 33 amino acid residues removed in its N-terminal end and/or 
with 3 or 4 amino acid residues removed in its C-terminal end. Even more preferably, the 
inventive modified human growth hormone receptor (hGHR) consists of the amino acid 
residues 32-237 fSEO ID NO: 2), 32-234 fSEO ID NO: 3) , or 34-233 (SEP ID NO: 4) of the 
native molecule. Of these modified molecules, the truncated receptor consisting of amino 
acids 32-234 (SEP ID NO: 3) of the native molecule is the most preferred. It should be 
emphasized that said modified cytokine receptors would be readily produced by the skilled 
person with existing methods of recombinant technology and their production in a 
recombinant host and their subsequent purification, therefore are not parts of the present 
invention. Further aspects of the invention are disclosed below. - 

The paragraph at page 12, line 8 through page 14, line 2 is amended as follows: 

-The hGH and hGHR used in the protein crystallographic work were expressed and 
purified as previously described in Sundstrom et al, (1996). Truncation mutants of hGHR 
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were created using standard sub-cloning techniques and the expressed protein was assayed 
for hGH binding using affinity and size exclusion gel filtration chromatography as well as 
BIAcore (Pharmacia Biosensor, Sweden) measurements. The I1GHR32-234 (SEQ ID NO: 3) 
protein was crystallized by vapor diffusion, using 3 ml protein solution (7mg/ml in 10 mM 
ammonium acetate) mixed with 3 ml of 0.33 M NH4SO4 30% (w/v) PEG-2000-dimethyl 
ether, 1% (v/v) DMSO and 100 mM MES buffer at pH 6.4 in a sealed tissue culture 24-well 
plate (Falcon, USA). The crystallization droplets were equilibrated at +18°C with 1 ml of the 
mother liquor for 2-4 weeks to obtain optimal quality crystals that diffracted to at least 2.9 A 
with a conventional X-ray source. The crystals were frozen directly in the N2 beam by 
adding a 1:1 mixture of 25% (v/v) ethylene glycol and 25% glycerol (v/v) to the 
crystallization droplet. Data was collected at station Al at Cornell High Energy Synchrotron 
Source using a CCD detector (Area Detector Systems Corp., USA). The data was indexed, 
processed and scaled in the tetragonal spacegroup 14 using the programs DENZO and 
SCALEPACK, developed by S. Bailey in the SERC Daresbury Laboratory, Warrington, 

1993. A molecular replacement search procedure was performed using the program 
AMORE, also developed by Bailey, 1993. The co-ordinates of the site 1 binding hGHbp 
molecule in our 2.5 A hGH:hGHR 1 :2 complex was used. The highest scoring solution in the 
resolution interval 8 - 4 A was found in space group 14,, with two hGHbp molecules in the 
asymmetric unit. A rigid-body refinement in X-plor, described by J. Navaza in Acta Cryst., 

1994, Vol. A (50), pp. 157-163, with individual hGHbp domains including data between 10- 
6, 10-5 and 10-3.5 A in each respective cycle, decreased both the R- and Free-R values 
(described by A.T. Brunger in Nature, 1992, Vol. 355, pp. 472-475) dramatically when 
compared to previous runs where the native hGHbp domain arrangement was used. A cyclic 
process of model building in O, described by Brunger, 1992, followed by NCS restrained 
POWELL minimization in X-plor, using data between 15 - 2.3 A, which was corrected for 
most main and side chain changes to the search molecule. At this stage, the first simulated 
annealing run, as described by T.A. Jones, et al. in Acta Cryst, 1991, Vol. A(47), pp. 1 10- 
119, was performed using a slow-cooling protocol from 3000 K to 300K in 50 Ps steps. 
Solvent molecules were introduced into FoFc densities above 3.0 s. After 3 cycles, a total of 
327 solvent molecules had been introduced and assigned to the protein chain using the 
programs DISTANG and WATERTIDY developed by A.T. Brunger et al., 1989, in the CCP4 
program package. A final POWELL minimization was performed, followed by a simulated 
annealing run from 2500 K to 300 K in 50 ps steps and including data between 15 to 2.3 A. 
Individual B-value refinement was added as the final step, and solvent molecules with high 
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temperature factors,_greater than 50 A 2 , or absent 2FoFc electron densities cut-off at 1.0 s, 
were removed. The Free-R value was used to validate the progress of the entire refinement. 
The final model consisted of residues 32 - 52, 63 - 70 and 80 - 234 of both molecules in the 
asymmetric unit as well as 261 solvent molecules and two sulphate ions. At the present stage 
of refinement, the R-factor of the model is 21.7% (R-free 29.3%), using data between 10 - 2.3 
A. As a control, a dataset to 3.2 A at room temperature was collected. No significant 
differences to the 2.3 A structure were observed, showing that the transfer to cryogenic 
conditions did not induce conformational adaptation. See also, Merritt et al, Acta. Cryst., 
D50, 869-73 (1994).-- 

The Table 1 heading at page 15, line 3 is amended as follows: 

-Crystallographic data for hGHR 32 -234 (SEP ID NO: 3V -. 
In the Claims : 

Claims 1, 2, 5-10, 42 and 43 are amended as follows: 

1 . (Twice amended) A cytokine receptor protein of the Class I Cytokine family, 
modified in the extracellular domain, wherein at least one terminal molecule segment which 
contributes to a disordered structure is deleted, the modified protein being capable of being 
crystallized without being complexed to a ligand molecule. 

2. (Amended) A [modified] protein according to claim 1 being a homo- or 
heterodimeric cytokine receptor of the Class I Cytokine family . 

5. (Amended) A [modified] protein according to claim [4] I truncated in its C- 
terminal and in its N-terminal end. 

6. (Twice Amended) A [modified] protein according to claim 5 wherein the 
cytokine receptor protein is human growth hormone receptor (hGHR). 

7. (Twice Amended) A [modified] human growth hormone receptor protein 
(hGHR) according to claim 6 having 31 or [32] 33 terminal amino acid residues removed in 
its N-terminal end. 
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8. (Third amendment) A [modified] human growth hormone receptor protein 
(hGHR) according to claim 6 having 3 or 4 terminal amino acid residues removed in its C- 
terminal end. 

9. (Fourth amendment) A [modified] human growth hormone receptor (hGHR) 
consisting of residues 32-237 ( SEP ID NO: 2) , 32-234 ( SEP ID NO: 3\ or 34-233 f SEO ID 
NO: 4), of the native hGHR molecule. 

10. (Third amendment) A [modified] human growth hormone receptor (hGHR) 
according to claim 9 consisting of residues 32-237 ( SEQ ID NO: 2) , of the native hGHR 
molecule. 

42. (Amended) [Modified human] Human growth hormone receptor protein, 
comprising human growth hormone receptor protein truncated in at least one terminal end to 
delete at least one molecule segment which contributes to a disordered structure, the modified 
human growth hormone receptor protein being capable of being crystallized without being 
complexed to a ligand molecule. 

43. (Amended) [A modified human] Human growth hormone receptor protein 
according to claim 42, truncated in its C-terminal end and in its N-terminal end. 
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abstract- We survey all the known instances of domain movements in proteins for which there is 
cryst^ographicevideL We explain these domain movements m terms of the repertoire 

^£*£y conformation changes that are known to occur in protems. We first describe the basic 
elements of this repertoire, hinge and shear motions, and then show how the elements of the repertoire can 
be to JiSduce domain movements. We emphasize that the elements used m particular proteins 
are determined mainly by the structure of the interfaces between the domains. 
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Nearly all large proteins are built from domains (Wodak 
& Janin, 1981), and large relative movements of domains 
provide spectacular examples of protein flexibility. Domain 
motions are important for a variety of protein functions, 
including catalysis, regulation of activity, transport of me- 
tabolites, formation of protein assemblies, and cellular 
locomotion. Domains often close around a binding site between 
them. Generally, the presence of bound substrates stabilizes 
a closed conformation, and their absence favors an open 
conformation. Consequently, domain motions illustrate 
induced fit in protein recognition (Koshland, 1958). 

Most of our information on the mechanisms of domain 
movements has come from X-ray crystal structures of open 
and closed conformations of particular proteins. The results 
of early investigations were reviewed by Janin and Wodak 
(1983) and by Bennett and Huber (1984). Since then, a 
considerable amount of new information has become available, 
and we review here the portion of this information that concerns 
structural mechanisms of domain closure. 

In catalysis, domain closure often excludes water from the 
active site and helps position catalytic groups around the 

t Supported by Damon Runyon-Walter Winchell Fellowship DRG- 
1272 (M.G.) and the Kay Kendall Foundation (A.M.L.). 
t MRC Laboratory of Molecular Biology. 

» Present address: Beckman Center for Structural Biology, Department 
of Cell Biology, Stanford Medical School, Stanford, CA 94305. 
I Department of Haematology, Cambridge University. 
J- Cambridge Center for Protein Engineering. 
• jttstract published in Advance ACS Abstracts, May 1 , 1994. 
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substrate. It also traps substrates and prevents the escape of 
reaction intermediates (Anderson et ai, 1 979; Knowles, 1 991). 
Domain closure, therefore, must be fast, and the transition 
between open and closed forms cannot involve high-energy 
barriers. Protein interiors, however, have features that place 
strong constraints on their possible conformational changes: 
they are close-packed with main chains and side chains in 
preferred conformations and with buried polar groups hy- 
drogen bonded. In the first part of this review, we discuss the 
repertoire of possible low-energy conformational changes that 
are available to proteins, i.e., their intrinsic flexibility. In the 
second part we describe how this repertoire of low-energy 
conformational changes are used to produce domain move- 
ments in particular proteins. 

THE INTRINSIC FLEXIBILITY OF PROTEINS 

The intrinsic flexibility of proteins is taken here to mean 
the ability of different segments of the protein to move in 
relation to one another with only small expenditures of energy. 
Analysis of protein crystal structures has shown that this 
intrinsic flexibility can take two forms: hinge motions in 
strands, 0-sheets, and a-helices that are not constrained by 
tertiary packing interactions and shear motions between close- 
packed segments of polypeptide (Figure 1; Table 1). 

(A) Hinge Motions in Strands. Sheets, and Helices Not 
Constrained by Packing Interactions, (1) p-Strands. The 
most basic motion of a polypeptide chain is a few large changes 
in main-chain torsion angles in a localized region, i.e., at a 

© 1994 American Chemical Society 
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simple example 
main-chain packing 
main-chain torsions 
motion overall 
motion at interface 
side-chain packing 
side-chain torsions 



citrate synthase 
constrained by close packing 
many small changes 
concatenation of small local motions 
parallel to plane of interface (shear) 
same packing in both forms 
mostly small changes 



Domain lQ Domain 2D Ligand 



Open 



Closed 




Interfaces 



Hinge 




Shear Motion 



Hinge Motion 



aXTfTe s rand remain in the allowed regions of the 
Ramachandran diagram. Consequently, its ors.on ang e 
fhZt Sr. be very large and the resulting mouon can rota e 
changes can pe very w * , ■ Figure 2 A, in 

the polypeptide chain up to 60 . as snown e 
actaTe dehydrogenase two adjacent torsion angle changes 
rotate a strand by -35° in .direction ««t access.We by a 
cinolp chanee (Gerstein & Chothia, 1991). 

^!SXX^ this additional constraint means 
52 to both * rands the rotation axes of the principal torsion 

axisoftheoverallrotationoftheshe^^ 

Three large (>30») torsion angle changes produce the bulk 

JS bSu e esidue' in helices are subject to more severe 

must bT^pread over more residues than the deformation of 
Teets Sch spread-out helical deformations can produce 

!?to d» C-terminus of a helix in a 

its end to produce a shift of 3.3 A (Dixon et al., 1992, Figure 



lactoferrin 
free to kink 
a few large changes 

^ntacts created; packing at base of h.nge crucal 
some large changes 



'^different situation occurs in those helices that contain 
kinks S often involve prolines. The disruption in he 
normal pattern of hydrogen bonding, and hence in the 
Coin's "n the helix, allows larger torsion angle changes 
Tshown in Figure 2D, such large torsion angle changes have 
ten found in the proline-kinked helix in adenylate kinase, 
been touna in m<= v extended conformations * 

The interconversion of helical ana ?* lcn " „. , $ 

U also oossible and has been found in calmodulin (Ikura et 
al ?9?Meador et al., 1992, 1993) and triglyceride lipase 
7rwwer.dae.-fl7 1992). While such an interconversion may 
Krcts ng energy barriers somewhat higher than those 
S^S^diwuSd above, it permits large torsion angle 
changrandlargedeformations. In calmodulin, torsion ang e 
T™Z to five residues in the middle of a long helix split it 

strand. These two small helices are inclined at an angle ot 

~ mUmited Shear Motions of Close- Packed Segments of 
PolvpepUd The preceding discussion of hinges considered 
ontEeff cuofstructuralconstraintsintrinsicto^strands 
See's and «-helices-«.e., constraints arising from the 

f4u ements of secondary *^.™'£^J£ 
* r^m tPftiarv structure provide even more severe 
I^^coSSt^t ofV atoms in a protein are 
paSy buried and closely packed-in particular, most of 
fhe main chain is buried beneath layers of side chains This ; 
dose^^ 

htoees Indfed, a structural requirement for a residue to act 
« a hinge is that it have few tertiary structure packing 

C °AsXwn°to Kg^^an divide movements of close- | 
patdtgments of polypeptide into those that are perpen- | 
dicular to an interface and those that are parallel. Hinges 

nVmendicular to the plane of an interface (so the interlace 

tSns. Althoughsuchpackingch^^^^^^^ 

interfaces of allosteric proteins (Perutz, 1989) tney nave ; nui 

ha.e the follows character* .est (I) Inl l™*'^, 1 ^, 
chains accommodate shear motions, mostly, by small ol3 



i Perspectives in Biochemistry 



Biochemistry, Vol. 33, No. 22, 1994 6741 




Pro 177 



FlGURE 2: Hinge motions in strands, sheets, and helices. (A, far left) A hinge in lactate dehydrogenase is an example of a isolated hinge in 
a strand. Changes in two torsion angles (&<t>(96) - 36° and A0(97) - 40°) are responsible for rotating the polypeptide chain ~35°. (B, middle 
left) The hinges in lactoferrin are an example of the coupling of two simple hinges together in a sheet. The hinges move through three larfce 
torsion angle changes, and the rotation axes for these torsion angle changes are inclined less than 20° with respect to the axis of the overall 
motion. (In the strand on the left A^(250) = -33° and A<£(249) = 30°; in the strand on the right A^(90) = 49°.) Small conformational changes 
in adjacent residues help maintain the integrity of the 0-sheet structure. As evident in Figure 6, the hinges have few main-chain packing 
constraints on them. (C, middle right) The interdomain helix in lysozyme is an example of a bending helix. It bends through the coordinated 
action of eight torsion angle changes between 9° and 15°, shifting the Ca atom at the C- terminal end of the helix by 3.3 A. (D, far right) 
The helix linking the two domains in ADK is an example of a kinking helix! A torsion angle change in the residue three before Pro 177 (A<£ 
= -53°) causes the helix to deform in a direction perpendicular to its original kink. 



Small 
Hinge 




Shear 
Interface 




Figure 3: Shear motions involve interfaces. Two examples taken from citrate synthase show helix-helix interfaces undergoing a shear motion. 
The two labeled axes show the direction of parallel and perpendicular motion at an interface. (Left) The QP helix-helix interface illustrates 
how small hinges in linking peptides function in shear motions. Helix Q shifts 1.4 A and rotates 13° relative to helix P. (Right) The NQ 
helix-helix interface shows a crossed-helix packing and a slightly larger motion than at the QP interface. Helix N shifts 1.8 A and rotates 
11° relative to helix Q. There are many close-packed side chains forming the N-Q interface, which just rock slightly in the shear motion. 



changes in side-chain torsion angles. They keep the same 
overall rotamer configuration and move among conformational 
states of nearly the same energy without crossing large energy 
- barriers. Occasionally, they may change to a different rotamer 
conformation (i.e., to a different local minimum) with large 
; ' rotations (> 100°). (2)The main chain of each segment in a 
jv, shear motion does not deform appreciably. In the case of 
. helices, the root mean square difference in the positions of 
; their main-chain atoms in the open and closed forms is typically 
> 0.15-0.25 A; for loops the difference is slightly larger. This 
rigidity, combined with "rocking" movements of side chains, 
j ■ ■ implies that the interface itself shears. (3) The segments shift 
[ and rotate relative to each other by no more than 2 A and 15°, 
amounts likely to be the limits of low-energy conformational 
: adjustments. Except at very small interfaces, larger move- 
; ments than these require the combination of several shear 
r . motions'. 



These characteristics were initially deduced from the 
analysis of protein crystal structures (Chothia et al. y 1983; 
Lesk & Chothia, 1984). A similar, and in some ways more 
detailed, picture of shear motions has recently emerged through 
physical studies and computational simulations (Elber & 
Karplus, 1987; Rojewska & Elber, 1990; Frauenfelder et al. t 
1991). 

SHEAR AND HINGE MOTIONS UNDERLIE 
DOMAIN-MOTION MECHANISMS 

The characteristics of the two basic mechanisms of protein 
flexibility, hinge and shear motions, are summarized in Figure 
1 and Table 1 . These two mechanisms constitute a repertoire 
of conformational changes that can be used in a great variety 
of protein motions. Here we describe their use in the motions 
of whole protein domains, i.e., in the relative motion of discrete 
linked units that consist, in most cases, of at least 1 00 residues. 
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Figure 2: Hinge motions in strands, sheets, and helices. (A, far left) A hinge in lactate dehydrogenase is an maniple of a isolated hinge in 
Tstrand Chances in two torsion angles (A^(96) = 36° and A<M97) = 40°) are responsible for rotating the polypeptide cham ~35 . (B, middle 
are a 8 n exUjle of the coupling of two simple hinges together in a sheet. The ^^^^g 
to«ion anale changes and the rotation axes for these torsion angle changes are inclined less than 20 with respect to the axis 01 tne overall 
Son (In SS^M ttoWtAj(250) = -33" and A*(249) = 30°; in thestrand on the right A*(90) = 49°.) Small conformational changes 
■ffffiS w^lft mta^Tthe integrity of the 0-sheet structure. As evident in Figure 6, the hinges have few ma.n-cha.n packing 
IstSon t£. (QrnTddTe right) The interdomain helix in lysozyme is an example of a bending helix. It bends through .the coord, nated 
action of eiaht torsion angle changes between 9° and 15°, shifting the Ca atom at the C-terminal end of the helix by 3.3 A. (D, ^ right) 
S^Si^t^mtlui ADK is an example of a kinking helix. A torsion angle change in the residue three before Pro 177 (A* 
= -53°) causes the helix to deform in a direction perpendicular to its original kink. 



Small 
Hinge 




Shear 
Interface 




Figure 3: Shear motions involve interfaces. Two examples taken from citrate synthase show he 

The two labeled axes show the direction of parallel and perpendicular motion at an interface. (Left The QP helix-helix interface illust rates 
how sma i hinges in linking peptides function in shear motions. Helix Q shifts 1.4 A and rotates 13° relaUve to hehx P. (R^O Jhe NQ 
helix-helix interface shows a crossed-helix packing and a slightly larger motion than at the QP interface. Helix N shifts 18 A .and rotates 
11° relative to helix Q. There are many close-packed side chains forming the N-Q interface, which just rock slightly in the shear motion. 



changes in side-chain torsion angles. They keep the same 
overall rotamer configuration and move among conformational 
states of nearly the same energy without crossing large energy 
barriers. Occasionally, they may change to a different rotamer 
conformation (i.e., to a different local minimum) with large 
rotations (> 100°). (2)The main chain of each segment in a 
shear motion does not deform appreciably. In the case of 
helices, the root mean square difference in the positions of 
their main-chain atoms in the open and closed forms is typically 
0.15-O.25 A; for loops the difference is slightly larger. This 
rigidity, combined with "rocking" movements of side chains, 
implies that the interface itself shears. (3) The segments shift 
and rotate relative to each other by no more than 2 A and 15°, 
amounts likely to be the limits of low-energy conformational 
adjustments. Except at very small interfaces, larger move- 
ments than these require the combination of several shear 
motions. 



These characteristics were initially deduced from the 
analysis of protein crystal structures (Chothia et al. y 1983; 
Lesk & Chothia, 1984). A similar, and in some ways more 
detailed, picture of shear motions has recently emerged through 
physical studies and computational simulations (Elber & 
Karplus, 1987; Rojewska & Elber, 1990; Frauenfelder et a/., 
1991). 

SHEAR AND HINGE MOTIONS UNDERLIE 
DOMAIN-MOTION MECHANISMS 

The characteristics of the two basic mechanisms of protein 
flexibility, hinge and shear motions, are summarized in Figure 
1 and Table 1 . These two mechanisms constitute a repertoire 
of conformational changes that can be used in a great variety 
of protein motions. Here we describe their use in the motions 
of whole protein domains, i.e., in the relative motion of discrete 
linked units that consist, in most cases, of at least 100 residues. 
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Hinge and shear mechanisms are also involved in the motion 
of small protein fragments, for example, when individual 
loops or helices move relative to each other. In Table 2 we 
summarize the current crystallographic evidence for hinge 
and shear mechanisms in both domain motions and 
smaller motions. It is important to realize that hinge and 
shear motions are ideal paradigms for describing large domain 
motions. A real domain motion will often have a combination 
of both motions, i.e., hinges in one part of the protein and 
shearing interfaces elsewhere. Nevertheless, many domain 
motions can be described as occurring predominantly by a 
hinge or a shear mechanism. 

As shown in Figure 1, proteins that have a predominantly 
hinged domain motion usually have two domains connected 
by linking hinge regions that are relatively unconstrained by 
packing. A few large torsion angle changes are sufficient to 
produce almost the whole domain motion. The rest of the 
protein rotates essentially as a rigid body, with the axis of the 
overall rotation passing through the linking hinge regions. 

Since an individual shear motion is small, a single one is 
usually not. sufficient to produce a large domain motion. 
Usually, a number of shear motions combine to give a large 
effect— in a similar fashion to each block in a stack sliding 
slightly to make the whole stack lean considerably. (The 
peptides that link the shearing segments have small main- 
chain torsion angle changes to accommodate the relative 
movements.) 

Proteins with shear motions tend to have certain architec- 
tural features. First, they often have layered architectures 
with one layer sliding over another. Second, though shear 
motions have been found at many different interfaces (i.e., 
helix-helix, sheet-helix, loop-sheet, and loop-helix), helix- 
helix interfaces are most commonly used. The helices involved 
in shear motions are usually crossed. That is, they are usually 
oriented in a more perpendicular than parallel fashion 
(interhelical angle 60°-90°). Such crossed geometries are 
unusual in that helix-helix packings tend to be more parallel. 
Crossed helices will obviously have a smaller and more 
accommodating interface than parallel helices, and this is 
perhaps the reason for their preferential involvement in shear 
motions. 

Table 2A lists all instances of crystallographically resolved 
domain motion, i.e., proteins that have been solved in two or 
more conformations. With the notable exception of the 
immunoglobulins, almost all large domain motions can be 
understood in terms of hinge and shear motions. Table 2B 
lists proteins for which a domain closure mechanism can be 
inferred. The structures of these proteins have been deter- 
mined in only one conformation. However, each has a 
structure similar to that of a protein with a well-characterized 
domain motion, i.e., one listed in Table 2A, and is expected 
to move using the same mechanisms. 

EXAMPLES OF SHEAR DOMAIN MOVEMENTS 

(A ) Citrate Synthase. Citrate synthase is one of the clearest 
examples of a domain closure occurring through shear motions. 
The molecule is a dimer, and each monomer comprises a large 
domain, containing 1 5 helices, and a small domain, containing 
five helices, with the active site cleft between them (Figure 
4). The domain closure involves the small domain closing 
over the large one, burying the substrates in the active site 
(Remington et al, 1982). An extensive interface between 
the large and small domains prevents closure taking place 
through a hinge mechanism. As shown in Figure 4, closure 
is produced by the summation of many small shear motions 
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Figure 4: Shear motions in citrate synthase. (Top left) Cartoon of 
onesubunitofcitratesynthasca-Helicesare represented by cylinders. 
The small domain contains helices N, O, P. Q, and R. (top right) 
Schematic showing the relative movements of the principal helices 
in citrate synthase. [This figure is adapted in part from "J 
Chothia (1984).) Each helix is represented by its letter, and the lines 
indicate the existence of helix-helix packings in both the open and 
closed forms. The shifts and rotations show local changes in the 
positions of pairs of packed helices (i.e., the movement m one helix 
in a pair relative to the other). (Bottom right) The overall effect of 
the helix movements. The same conventions as in the top right 
schematic apply, but the shifts and rotations shown now are those 
required to superimpose equivalent pairs of helices afte. r the open 
and closed forms have been superimposed on the core of the large 
domain. Many small motions add up to shift helix O by 10.1 A and 
rotate it by 28°. (Bottom left) Incremental motion in shear domain 
closure is shown by Ca traces of the OP loop: black is the apo form; 
white, the holo form; gray, the cumulative effect of motion over *e 
K. P, and then Q helix-helix interfaces. (The apo form was fit to the 
holo form, first on the core and then on the K, P, and Q helices.) 

between pairs of packed helices (Lesk & Chothia, 1 984). The 
overall motion results in a helix on the far side of the small 
domain shifting by 10 A and rotating by 28°, thereby moving 
an adjacent loop over the active site. Each local shear motion 
involves one helix moving relative to a neighboring helix by 
main-chain rotations and shifts of up to 13° and 1.8 A, To 
a good approximation, the main chain of each helix moves 
without deformation as a rigid body. The shear motions are 
facilitated by small deformations in the loops Unking the 
helices. 

There are over 50 distinct helix-helix interfaces in the citrate 
synthase dimer. Depending on the angle between neighboring 
helices, these interfaces can be categorized as having roughly 
parallel helices, roughly perpendicular ones, or orientations 
in between. The interfaces between many of the moving helices 
tendtoberoughlyperpendicular.or-crossed-.whilethehehces 

that are relatively motionless tend to have a more parallel 

packing. 

IB) Aspartate Amino Transferase. In citrate synthase the 
domain closure is the cumulative result of many shear motions. 
In aspartate amino transferase (AAT) the domain motion is 
mainly the result of just two shear motions, which occur in 
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Figure 5: XBAabx layering in hexokinase and other proteins. 
yfy XBAabx layering [see Examples of Shear Domain Movements (C)] 
%M is shown graphically by schematics of GAPDH (left) and hexokinase 
! " (right). Helices are drawn as narrow cylinders (radius 1 .0 A); sheets 
l j\ '.are represented as sheets as opposed to collections of strands; and 
,t ; substrates are drawn in CPK representation. (GAPDH is shown in 
« j ; its closed form with its actual ligand. Hexokinase is shown with the 
'■'tKi} inhibitor o-toluoylglucosamine.) 

perpendicular directions (McPhalen et al, 1992). A AT has 
an active site situated between a large and a small domain, 
$:]and on substrate binding the small domain closes over the 
■^ active site. The major shear motion involves a 13° rotation 
■^f of the core of the small domain relative to the large one. A 
^secondary shear motion moves a helix on one side of the small 
r - ^ domain in a direction perpendicular to both the iriterdomain 
■ % interface and the direction of the other shear motion. With 
a 1.2- A shift and a 10° rotation, it "drops down" to cover the 
^. ; active site. 

• The shear motions in AAT are facilitated by a hinge motion 
in a long interdomain helix. This helix is kinked by 17° in 
the open form and changes its kink angle by 12° on closure. 

(QGlyceraldehyde- 3 -phosphate Dehydrogenase, Alcohol 
.Dehydrogenase, and Hexokinase. In the previous two 
examples, domain closure involved motions spread throughout 
a domain. Here we describe three examples where the major 
shear motion occurs at the interface between the domains 
with subsidiary motions on one or both sides of this region. 
I Because the enzymes hexokinase, glyceraldehyde-3-phos- 
phate dehydrogenase (GAPDH), and alcohol dehydrogenase 
(ADH) share many common architectural features, their 
domain movements proceed through similar mechanisms (see 
Table 2 for detailed references). These enzymes have three 
; < v moving layers in one domain that shift relative to three rigid 
\ layers in the other domain. This distinctive layering pattern 
. • is of the form XBAabx, where a, b, and x are the three moving 
' layers and X, B, and A are the rigid layers (Figure 5). The 
r interface between the two middle layers is where the major 
shear motion occurs. One layer of helices from the mobile 
domain (a) slides over a layer of helices from the motionless 
domain (A). Helices in these two layers, which in a sense 
form gears upon which the domains slide, are often crossed, 
as is dramatically illustrated for the case of hexokinase in 
Figure 5. Near the a and A layer helices, the ligand binds 
*f , in the interdomain cleft. Packed onto either side of the central 
] .V layers of helices (a and A) are sheets (b and B) from the 
\\ - mobile and motionless domains, respectively. The mobile sheet 
\ (b) forms a second moving layer, which slides over the helices, 
^ and packed onto the other face of this sheet is a third layer 
(x) which moves with the sheet. Symmetrical to this third 
moving layer (x), a third motionless layer (X) is packed onto 
one side of the static sheet (B). (Layer x is made up of helices 
i: .'in hexokinase and GAPDH and of helices and a sheet in ADH, 
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and layer X is made up of helices in hexokinase and ADH and 
of a sheet from another subunit in GAPDH.) 

In addition to its shear motion, ADH also has two well- 
defined hinge points (Eklund et al, 1981; Colonna-Cesari et 
al, 1986). 

As discussed in Table 2B, a number of other proteins have 
XBAabx architectures similar to those of hexokinase, 
GAPDH, and ADH but have not yet been solved in multiple 
conformations. These proteins include phosphoglycerate 
kinase (PGK), actin, and heat-shock protein. There is 
experimental evidence that these proteins may undergo domain 
movements [e.g., for PGK, see Mas et al (1987, 1988)], and 
they would be expected to use mechanisms similar to those 
of hexokinase, GAPDH, and ADH. Moreover, a model- 
building study done on PGK (Blake et al., 1 986) predicts that 
the domain movement will involve the shearing of the two 
central helices, a conclusion similar to that implied by our 
comparisons. 

(D) trp Repressor. In the previous sections we described 
examples of domains closing around substrates. In the trp 
repressor, the binding of a ligand stabilizes a more open 
conformation. The trp repressor is a small protein that 
regulates three operons involved in the synthesis of tryptophan. 
It is a dimer, and each subunit contains six helices, divided 
between two domains. The central core of the molecule is 
formed from four helices from each subunit. On either side 
of this core, helix-turn-helix motifs form two symmetrically 
arranged DN A "reading head" domains. Between the central 
core and the reading-head domains, there are two binding 
sites for L- tryptophan, which need to be filled for trp repressor 
to recognize DNA (Zhang et al, 1987). Comparison of the 
hoio and apo forms of the repressor (Lawson et al, 1988) 
shows that the binding of L-tryptophan shifts Ca atoms in the 
reading head domain by up to 4 A. These shifts are produced 
by separate shear motions of the two helices in the reading- 
head domain (0.75-1 .5 A, 5-20°). These helix motions move 
the reading-head domains further apart than they are in the 
apo form so they are correctly separated to bind DNA. 

EXAMPLES OF HINGED DOMAIN MOVEMENTS 

(A) Tomato Bushy Stunt Virus. An example of a very 
simple hinge motion is found in the coat protein of tomato 
bushy stunt virus (Olsen et al, 1983). This spherical virus 
contains 180 subunits arranged with icosahedral symmetry 
on a T = 3 lattice. Each subunit, in turn, contains two major 
domains, the shell (S) and projection (P) domains, that are 
linked by a peptide in an extended conformation. The 
symmetry of the virus requires each subunit to fit into one of 
three different packing environments. One of the principal 
mechanisms for accommodating the different environments 
is a relative movement of the two domains by ~22°. This 
movement involves a simple hinge in the peptide connecting 
the S and P domains (Olsen et al., 1983). 

(B) Calmodulin. Like the TBSV coat protein, the motion 
in calmodulin involves a single deformation. The unligated 
form of calmodulin contains two globular domains, connected 
by a long helix (Babu et al., 1 985). NMR and X-ray structures 
of ligated calmodulin show the molecule binding to peptide 
helices with different sequences and the two domains closing 
around the peptide far enough to make contact with each 
other (Ikura et al., 1992; Meador et al.. 1992, 1993). As 
discussed above [The Intrinsic Flexibility of Proteins (A) (3)], 
in this motion, the long interdomain helix, which is known to 
have only marginal stability in solution (Ikura et al., 1992), 
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Table 2: Proteins That Undergo Domain Movements 



(A) Proteins for which open and closed conformations are known a b 
(i) Domain motion Is predominantly shear 



Citrate Synthase c 


1CTS 
3CTS 


Remington et a/., 1982; 
Lesk & Chothia, 1984 


Shear motions at many helix-helix interfaces shift rnainchaln 
atoms up u) tu/\ 


Aspartate Amino 

Transferase (AAT) c 


9AAT 
1AM A 


McPhalen era/., 1992 


Shear motion at 2 interfaces combined with hinge in a kinked 
neiix. — - 


Trp Repressor c 


1WRP 
2WRP 
3WRP 


Lawsonefa/.,1988 


Shear motion between 2 helices adjusts position of helix-turn- 
u.tiw mar-Tinn haoH HrtmAin to enable it to bind DNA 


Hexownase 


2YHX 
1HKG 


Bennett & Steitz, 1 978, 1 980; 
Lesk & Chothia. 1984 


Shear motion with XBAaba layering. Prominent crossed heOces 
at interdomain interface. 


GlyceraJdehyde-3-phosphate 
Dehydrogenase (GAPDH) c 


1GD1 
2GD1 


Skarzynski & Wonacott, 
1968 


Shear motion with XBAaba layering. 


Alcohol Dehydrogenase (ADH) c 


8ADH 
6ADH 


EWund efa/.,1981; 
Colonna-Cesari eta!., 1986 


Shear motion with XBAaba layering and 2 hinges. 


Endothtapepsin 


4APE , 
5ER2 


Sail era/.. 1989; 1992 


Small shearing motion at 1 interface between domains (17° 
rotation and 1 A displacement) 



(H) Domain motion is preaon 

Tomato Bushy Stunt Virus (TBSV) 
Coat Protein 6 6 


mnanuy 
2TBV 


CHson etal., 1983 


1 interdomain linkage,1 hinge, -22° rotation. 


Lactoferrin c 


1LFH 
1LFG 


Anderson ef at 1990; 
Gerstein etal., 1993b 


2 interdomain linkages, 2 hinges (in a p-sheet), 53° rotation. 
See-saw between two interfaces. 


Maltodextrin Binding Protein 
(MBP) c 


10MF 
2MBP 


Sharffefa/. t 1992; 


3 interdomain linkages, 3 hinges, 35° rotation. 


Lysine/Arginine/Omithine (LAO) 
binding protein c 


1LST 


Oh ef a/., 1993 


2 interdomain linkages, 2 hinges, 52° rotation. 


T4 lysozyme mutants: 
He3->Pro & Met6-*lle c 


1L96 
1L97 


Dixon et a/., 1992; 
Faber& Matthews, 1991 


2 hinges, at either end of interdomain helix, produce rotations up 
to 32°. 


Adenylate Kinase (ADK) c 9 


1AK3 
1AKE 


Schulz era/., 1990; 
Gerstein etal., 1993a 


2 interdomain linkages and 4 hinges (one involves kinking helix). 
60° rotation from 1st pair of hinges, 30° from 2nd pair, 90° total. 


Cataboiite Gene Activator Protein 
(CAP) 6 


3GAP 


Weber & Steitz, 1987 


1 interdomain linkage and 1 hinge. Comparison of sub-units in the 
dimer reveals that the small domain has rotated -30° closer to 
the large domain in one subunit 


cAMP-dependent Protein Kinase 
(catalytic domain) cd 


1ATP 
1APM 


Karisson etal, 1993 


1st set of hinges, involving 3 interdomain linkages, produces 12° 
rotation of domain cores (with -3 A shift). 2nd set of hinges 
produces further 6° rotation of a loop. 1 shearing interface 
between domains. 


Calmodulin c 

Glutamate Dehydrogenase 


1CLL 
4CLL 
2BBM 


Ikura etal., 1992; 
Meador et al., 1992, 1993 

Stillman etal. t 1993 


1 interdomain linkage, 1 hinge, -150° rotation. Hinge Involves 
long helix splitting into 2 helices (inclined at -100°) with strand 
in between. 

1 3° rotation of 1 domain relative to other 



(ill) Domain motion is not pr edominantly a hinge or shear mechanism 

Immunoglobulins c n 



Serpins 



2FB4 
1MCP 



5API 
10VA 



Bennett &Huber, 1984; 
Lesk & Chothia, 1988; 



Loebermann etal., 1984 
Engh era/., 1990 
Stein & Chothia, 1991 
Mottonen etal., 1992 



Hinge motion in linking peptides. 
Bali & socket joint forms interface between domains. 
Range of rotations up to 50° allowed. 



Translation at a helix-sheet interface results in the 
transformation of the tertiary structure by insertion of strand into 



HI V-1 Reverse Transcriptase 


1HMI 
1HVT 


Kohlstaedt etal., 1992; 
JacoboMoJina etal., 1993 


Comparison of subunits shows very large rearrangement of 2 of 
the 4 domains which is accomodated by changes in loops and 
by unfolding of small 3 stranded (J-sheet 


TATA-box Binding Protein 
(TBP) e 


1TBP 


Kim et al., 1993a, 1993b; 
Chasman etal., 1993 


Twisting of a central sheet moves 2 domains -10°. 


Themolysin, Elastase, neutral 
proteases 


1EZM 
4TMN 


Holland efa/.,1992; 
Thayer etal., 1991 


Bending interdomain helix 


Elongation Factor Tu (EF-Tu) d 


1ETU 


BerchtokJ era/., 1993; 
Kjeldgaardefa/.,1993 


Internal loop movements similar to those in ras protein (below) 
lead to large domain rearragements (90° rotation, 40A shifts) 
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.Table 2: (Continued) 



(B) Proteins for which only one conformation Is known 
(I) Domain motion Is predominantly shear 



Phosphogty cerate Kinase (PGK) 1 


; 3PGK 


Harios eta}., 1992 


Similar to hexokinase (XBAabx layering) 


Mont Qhnrk Pmtain 


lnoO 


rlanerty et al. ,1990 


Similar to hexokinase (XBAabx layering) 


Actin 


1ATN 


Kabsch etal. ,1990; 
Raherty era/., 1991 1 


Similar to hexokinase (XBAabx layering) 


Aspartic Proteases, besides 
endothiapepsin: Penicllopepsin, 

k Rhlzopuspepsin, Chymosin, 

' Porcine Pepsin 


2APP 
2APR 
2PEP 
3CMS 
1PSG 


Salt etal., 1992 


Similar to endothiapepsin 


; (11) Domain motion is predominantly hinge 


, Sulfate & Phosphate Binding 
Proteins 


1SBP 
1ABH 


Luecke & Quiocho, 1990; 
Pflugrath & Quiocho, 1988 


Similar to MBP & lactoferrin. These are group-ll periplasmic 
binding proteins. 


Arabinose, Leucine, & Galactose 
' Binding Proteins 


2LBP 
2GBP 
1ABP 


Gilliland & Quiocho, 1981; 
Vyas etal., 1988,1991; 
Sack etal., 1989a,b 


Similar to MBP & lactoferrin. However, these are group-l 
periplasmic binding proteins and are not as similar as group-ll . 
ones (above) are. 


i Transferrins (N-terminal lobe) 


1TFD 


Sanaa/ a/., 1990 


Similar to lactoferrin 


Guanytate Kinase (GDK) 


1GKY 


StehJe & Scriuiz, 1990 


Similar to ADK 


i 'Porphc«IirK)genDearrijnase 


1PDA 


Louie etal., 1992 


Domains 1 and 2 similar to lactoferrin 


: (111) Domain motion can not be classified at present f 


> . Myosin 




Rayment etal., 1993 


Closure of a nucleotide-binding cleft, with similarities to that of 
ADK, hypothesized to produce movements > 50 A 


Transducin-a 




Noel etal., 1993 


Similar movements to EF-Tu and ras expected 


(C) Proteins known In two conformations which Involve movements of fragments smaller than domains a 
(1) Motion Is predominantly shear 


Insulin 0 


4INS 


Chothia etal., 1983 


Helices shear by -1,5 A. 


Thymidylate Synthase 


3TMS 
2TSC 


Perry et at., 1990; 
Montfortera/., 1990 


Small shear motion of helices packed onto central sheet. 


Dlhydrofdate Reductase 
(DHFR) 


4DFR 
5DFR 


Bystroff etal., 1991 


Small (-3 A) movement, shearing interface with hinges. 


(II) Motion is predominantly hinge 


AmexinV 


1AVR 
IRAN 


Sopkova era/., 1993; 
Concha, e/a/.,1993 


Large movements of 2 loops and end of a helix moves a buried 
trp residue 1 8A to surface. 


Lactate Dehydrogenase (LDH) 


6LDH 
1LDM 


White etal., 1976; 
Gerstein & Chothia, 1991 


Loop closure with 2 hinges, one in helix, moves Cot atoms -1 1 A 


Triose Phosphate ! some rase 
(TIM) 


2YR 
3TIM 
6TIM 


Lofis & Petsko, 1990; 
Joseph et at., 1990; 
Wirengaefa/., 1991 


Loop closure with 2 hinges moves Ca atoms - lk 


Endase 


3ENL 
7ENL 


Ubkxfe &Stec, 1991 


Loop movements of -7 A 


H1V-1 protease 


4HVP 
3HVP 
5HVP 


Miller etal., 1989; 
Fitzgerald era/., 1990 


Two large loop regions, that together comprise one quarter of the 
structure, move Ca atoms - 7A 


Foot and mouth disease virus d 


1BBT 


Parry etal., 1990 


Comparing variants of virus shows movement of a surface loop 


Triglyceride Lipase 


1TGL 
4TGL 


Derewenda era/., 1992; 


2 hinges on either side of a helix move Ca atoms up to12 A. In 
one hinge a residue changes from an extended to a helical 
conformation. 


Isocitrate Dehydrogenase d 


3ICD 


Stoddard & Koshland, 1993 


Loop movements of -2 A 


Malats Dehydrogenase 
(MDH) 6 


4MDH 


Birktoft a/ a/., 1989 


Comparison of subunits shows a loop closure similar to LDH, 
moving atom Ca atoms up to 8 A. 


res Protein 


4Q21 
6021 


vlifbum etal., 1990; 
Sclichting etal., 1990 


2 loop movements move Ca atoms up to 1 0 A 
(one movement includes helix attached to loop). 



rrflT V W P \ ? IOr T S "* Kn0Wn * we refer t0 the P a P ers that describe structure comparisons. Further 

references to the individual open and closed structures can be found in these papers. " Allosteric proteins are not included 
because these proteins have motions that involve extensive repacking of interfaces (see Perutz, 1989 for a review) Such 
E' ng '"I hl « h - e " er 8y confor ™tional transitions distinctly different from the hinge and shear mechanisms. « 
JJ.cates protems d.scussed in detail in the text. " Structures of 2 conformations have been solved but only 1 has been 
depos ted in the Protein Data Bank. Motion is evident in comparing different subunits in the asymmetric unit. Single data 
bank identifier applies for both forms. f It is not possible to classify some domain motions at present because full sets of 
coordmates or detailed analyses are not yet available. * ADK also has a shear motion when the first substrate, AMP, binds- 
^™ n , m0V f n , g , , the .c° n f™ion of 3 ADK to 1AK3, 3 helices with a crossed geometry shift 1-2 A to rearrange the 
geometry of the nucleotide binding site slightly (Diederichs & Schulz, 1991). h Data bank indentif.ers for only two of the 
numy representative immunoglobulin stuctures are indicated. ' This paper describes the structural similarity of actin and the 
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the two antiparaUel ^strands ^ 

and 251) are indicated by empty circles, my lie theopen an d closed forms, respectively- N I is snown ay i Note 
the closed form. 



partly unfolds to break into two helical segments connected 
by a hinge region in an extended conformation. The angk 
between the axes of the two helical segments is -100 As 
Sls an additional twist around the helix axes the total 
oSonofonedomainrelativetotheotherismorethanl 0 

of flexibility in the side chains that make contact with he 
lt£ andVy sUghtly shifting the relative placement of the 
ELins throueh changes in the extent of the hinge region, 
S5^S£«»U? be«n dubbed -a variable expansion 

'^X^eM^l Likecalmodulin twomutants 
of T4 Ivsozyme (lie 3 - Pro and Met 6 He) have a hinge 
ITion invoSg a long interdomain helix. Crystals of these 

tie crystal form, their structures either are very similar to 
S£HK XeLSmall nmge points for the 

cSrm naTpart of the helix (Figure 2C). As the locahon o 
fh * next to the hinge, the domain motion appears 
ob^aconsequenceof the lossof close packing "eatedbythe 
mutaUon and is an example of hinged motion created by 
reducing the number of steric constraints. 

ID) Ictoferrin and the Periplasm* Binding Protetr^ 
UnSe the TBSV coat protein, lysozyme, and calmodulin 

threemterdomainlinkages.conW 
are examples of transport proteins that use domain closure to 
recoanize and sequester small molecules. 
Tactoferrin hu two similar lobes, and each lobe, in turn, 
has two domains with an iron-binding site between thern^ 
Analyses of the open and closed forms * o-jofkta 
detailed picture of the domain movements (Anderson et at., 
S£ "JSrin et al. 1993b). Upon binding iron, the two 



domainsmovetogether,rotating53°essentiallyasrigidr^e So 

ThTaxTs of rotation passes through the two grands hnlang 
ineaxis oi r v disC ussed above (Figure 2b),, 

of the principal torsion angle changes are nearly paraUel «■ 
the axis of the overall 53° rotation, the local motion in 
the X- «n be directly related to the overall doman, 

Cl< The e two domains make different packing contacts in thej 
Tr\K?A forms In the open form the contacts are on 

^ ween th^ - wo interfaces: when the domains close, residue 
?rS2rito « <»• side of the hinges become buned and 
^ZptTvA residues on the other side become exposed. | 
The situation is reversed on opening. 

Lactoferrinsharesasimilar structure, topology andbinding,; 

sitetots Son^ 

Bake" Vol 1987). For two of these binding pr« ; 
SexVn!b^ 

p/ nl 1 992" Spurlino et a/., 1991; Oh et al, 1993), structures 

SrmeXnism of domain movement appears to besrnnlar to { 
XSntaSoKrin. The domain motion in the m^extnn- 
tnat n » aciul 0 ta tion a bout an axis through the 

located i torsion angle 
; in the three peptides linking the domains. The 
SKI two of thehinges are structurally equivalent o 
Ke of the actoferrin hinges. In the LAO-binding prote m 
thedsaS^rotationofthe 

aTw large torsion angle changes in a region structurally, 
eauivalent to the lactofernn hinge. 
7e) Adenylate Kinase. Amorecomplexandextenstvehuige . 

eTzymehastwonucleotidebindingsites.andcrystalstructwes 

have been solved with both sites, a sin glesite 

filled (Schultzeffl/.. 1974, 1990; D.edenchs & Schulz, 199^ 

Ser ftSchulz, 1992). The major conformational change. 
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Hinge 



■13' -t 

- r i ' ) . 

: £.f FtoURB 7: Ball-and-socket motion in the immunoglobulins. (Left) The conserved Vh-Chi contacts and the switch (hinge) peptides. Three 
-iv H residues (11, 1 10, and 1 12) form a "socket", and two C H i residues, 149 and 150, form a "ball". The view is such that the motion of the 
J| V dimer relative to the C dimer is perpendicular to the page. (Middle and Right) The movement of the ball-and-socket joint. The five side 
"M chains in the joint are represented by spheres drawn at one-half van der Waals radius. White spheres indicate the socket, and black ones the 
( .?^ ball. The orientation is roughly perpendicular to that in the left figure (see the eye symbol there). The middle figure shows the packing that 
occurs when the domains are fully extended (i.e.. 180° elbow angle), and the right figure shows the packing that occurs when the domains 
*Mire close enough to be in contact (i.e., 135° elbow angle). 




2|\which occurs when the second substrate binds, rotates the 
r :|4'smaller of the two domains ~90° relative to the larger one 
! f|;aSd shifts main-chain atoms up to 32 A. The small and large 
^domains are linked by two helices, and on closure, confor- 
t||mational changes take place in four hinges at the N and C 
tff termini of these linking helices (Gerstein et al. y 1993b). Two 
S|df these hinges have simple motions; a third hinge requires 
5|| : motion throughout an extended loop; and a fourth hinge 
(Figure 2D) occurs in the middle of a proline-kinked helix. 
ffThe four hinges have few packing constraints on their main 
chain. One pair of hinges is responsible for one-third of the 
iMtjrtal' rotation, and the other pair, for two-thirds. 
■^f(F) cAMP- Dependent Protein Kinase. Like ADK, the 



1 ^Catalytic subunit of cAMP-dependent protein kinase has an 
|f$aborate multipart hinged motion, which involves at least 
^J; five distinct hinges, split into the two sets. Containing two 
^domains, one large and one small, the structure of the catalytic 
% iubunit has been solved in binary and ternary complexes with 
|l in inhibitory peptide and in an apo form (Knighton et a/., 
if 1991; Karlsson era/., 1993). In a comparison of the apo form 
11 with either complex form, the core of the small domain rotates 
j| 1^12° relative to that of the large one. The small domain is 
%l principally connected to the large domain through three 
jjtoughly parallel peptide linkages, which deform as hinges upon 
■||;closure. In addition, through the deformation of two more 
If: hinges a loop in the small domain near the binding pocket 
;3 rotates a further 6° down into the interdomain cleft. Partly 
^because of the size of the interdomain cleft, which has to 
1> M accommodate a 15-residue peptide, the protein kinase motion 
|] does not involve an extensive interdomain interface. There 
however, one helix in the small domain which moves in a 
% shear fashion to maintain its contacts with the large domain 
}\ throughout the motion. 

%r 

^ THE BALL-AND-SOCKET MOTION IN THE 
^IMMUNOGLOBULINS 

Th c domain motion observed in the immunoglobulins 
■.involves, so far as is known at present, a unique combination 
^ . of hinge and shear motions. In the immunoglobulins the V L 



domain is linked by an extended peptide to the Cl domain, 
and V H is similarly linked to Chi. V l and V H pack together, 
as do C L and C H i . The V L -V H dimer can freely rotate, relative 
to the C l -Chi dimer, over a range of ~50° in a manner 
described as "elbow motion". 

Elbow motion involves localized deformations in the two 
peptides that link the V and C dimers (Bennett & Huber, 
1984). These deformations are similar to those in the hinged 
domain closures described in the previous sections. However, 
the elbow motion also involves an unusual type of shear 
motion: two large residues in Chi, a Pro and a Phe, pack 
closely together, forming a "ball", and three residues in V H 
spread out as part of a 0-sheet, forming a "socket* (Figure 
7). The three Yh and two Chi residues are packed together 
and move relative to each other in a manner similar to a socket 
moving over a ball (Lesk & Chothia, 1988). 

Unlike the shear motions discussed above, which are 
characterized by close-packed interfaces of interdigitating 
sidechains, the ball-and-socket joint has a "smooth" interface, 
in which the side chains do not interdigitate. This interface 
facilitates motion over a wide range of relative orientations. 
It also permits greater flexibility than is found in shear 
motions: the socket residues can move up to 4.5 A relative 
to those in the ball, rather than the 1.5-2.0-A displacement 
usually found at an interface undergoing shear motion. 

THE STABILITY OF THE CLOSED AND OPEN 
STATES 

The evidence currently available suggests that the open 
and closed states are only slightly different in energy and at 
room temperature are in dynamic equilibrium. This small 
energy difference between the open and closed states is most 
directly suggested by the discovery that relatively weak crystal 
packing forces can stabilize the unliganded closed forms of 
lactoferrin and the binding proteins (Baker et ai, 1 99 1 ; Sharff 
et aL, 1992, and references therein). It is also suggested by 
simulations of loop closure (Wade et ai, 1993, 1994). 

The relative stabilities of the open and closed states depend 
on the presence or absence of the substrate. A likely 
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progression is that the substrate first binds to one domain 
then thermal fluctuations bring the second domain intocontact 
with it, and the newly formed contacts stabilize the closed 
conformation. Theabilityofaligandtobindtoasingledoma.n 

has in fact, been observed in transferrin (Lindley et al.. 1 993). 
Inspectionofthestructuresofligandedclosedstatesinvar.ably 

shows that the ligand makes numerous interlocking salt 
bridges, hydrogen bonds, and packing interactions with both 
domains (references in Table 2), and these interactions account 
for the stability and specificity of the closed state. Catalytic 
transformation of the substrate destroys, at least in part, the 
interactions made with the protein and so makes the open 
form more stable. The rate of domain movements, conse- 
quently, is governed to a degree by the catalytic efficiency of 
the protein. This may be particularly relevant to domain 
movements involved in locomotion. 

The main function of the open form is to allow access to 
the active site. By itself, this function does not require the 
open form to have a unique conformation, as opposed to a 
range of conformations. Experimental evidence for a unique 
open form is sketchy and mixed. On the one hand, there is 
clear evidence that the open form has a unique conformation 
in certain proteins. As discussed above [Examples of Hinged 
Domain Movements (C)l, in lactoferrin the interdomain 
interface formed in the open form appears to uniquely fix its 
conformation. Likewise, within particular species, AAT has 
the same open conformation in different crystal forms, which 
have very different intermodular contacts (McPhalen et al.. 
1 992) On the other hand, there is also evidence that the open 
form of other proteins can have a range of conformations. T4 
lvsozyme has been found to have a number of different open 
conformations in various crystal forms (Faber & Matthews, 
1991- Dixon et al.. 1992). The leucine/isoleucine/vahne- 
binding protein has been solved in a "more-opened form 
(Sharff et al., 1992, and references therein). A variety of 
different orientations have been found for the two domains 
of Escherichia coli NADP + -dependent glutamate dehydro- 
genase; this hexameric protein has been solved in a crystal 
form where all six subunits are in the asymmetric unit (D. 
Rice, personal communication). 

Note that the crystallographic evidence relating to the 
uniqueness of end states must be treated with care as there 
is a possibility that the intermolecular contacts in the crystal 
may fix domains in orientations not preferred in solution. Also, 
crystallography tends to make one think in terms of discrete 
rigid conformational states, which may be an erroneous model 
for open and closed conformations. 

CONCLUSIONS 

We have shown how hinge and shear motions which 
constitute the repertoire of low-energy conformation changes 
available to proteins, can be combined to describe most of the 
known instances of domain movements. We emphasize the 
importance of the architecture of the interdomain interface 
in determining the relative mix of hinge and shear motions. 
Whileour hinge and shear mechanisms do not describe domain 
motions precisely enough for accurate energy calculations, 
they provide a conceptual framework for understanding 
complicated structural transformations and can be used as a 
guide for more quantitative formulations. As more data 
become available, the descriptions of hinge and shear 
mechanisms should be refined and extended so that they can 
be applied to the complex large-scale motions that occur in 
structures such as myosin (Rayment et al.. 1993>. 

An expanded and routinely updated version of Table 2 (a 
listing of protein structures that undergo conformational 
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change) will be available electronically in plain text and j 
hypertext forms. Use (i) anonymous ftp or WWW with URL 
"file-/ /cb-iris.stanford.edu/pub/ProteinMovements/Protein- , 
Movements.htmr, (ii) anonymous ftp to »"al.mrc- mb. ,; 
cam ac.uk" for filename "pub/ProteinMovements/Protein- j, 
Movements.html", or (iii) electronic mail to mbg® 
cb-iris.stanford.edu. 
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ABSTRACT The explosive accumulation of protein se- 
quences in the wake of large-scale sequencing projects is in 
stark contrast to the much slower experimental determination 
of protein structures. Improved methods of structure predic- 
tion from the gene sequence alone are therefore needed. Here, 
we report a substantial increase in both the accuracy and 
quality of secondary-structure predictions, using a neural- 
network algorithm. The main improvements come from the use 
of multiple sequence alignments (better overall accuracy), from 
"balanced training" (better prediction of 0-strands), and from 
"structure context training" (better prediction of helix and 
strand lengths). This method, cross-validated on seven differ- 
ent test sets purged of sequence similarity to learning sets, 
achieves a three-state prediction accuracy of 69.7%, signifi- 
cantly better than previous methods. In addition, the predicted 
structures have a more realistic distribution of helix and strand 
segments. The predictions may be suitable for use in practice 
as a first estimate of the structural type of newly sequenced 
proteins. 

The problem of protein secondary-structure prediction by 
classical methods is usually set up in terms of the three 
structural states, a-helix, j3-strand, and loop, assigned to 
each amino acid residue. Statistical and neural-network 
methods use a reduction of the data base of three-dimensional 
protein structures to a string of secondary-structure assign- 
ments. From this data base the rules of prediction are derived 
and then applied to a test set. For about the last 10 yr, 
three- state accuracy of good methods has hovered near 
62-63%. Recently, values of 65-66% have been reported 
(1_4), However, when test sets contain proteins homologous 
to the learning set or when test results have not been multiply 
cross-validated, actual performance may be lower. 

Point of Reference 

We use as a * 'reference network" a straightforward neural- 
network architecture (5) trained and tested on a data base of 
130 representative protein chains (6) of known structure, in 
which no two sequences have >25% identical residues. The 
three-state accuracy of this network, defined as the percent- 
age of correctly predicted residues, is 61.7%. This value is 
lower than results obtained with similar networks (5, 7-10) 
for the following reasons. (0 Exclusion of homologous pro- 
teins is more stringent in our data base — i.e., test proteins 
may not have >30% identical residues to any protein in the 
training set. Other groups allow cross-homologies up to 49% 
[e.g., 2-hydroxyethylthiopapain (lppd) and actinidin (2act) in 
the testing set termed "without homology" in ref. 5] or 46% 
(4). (//) Accuracy was averaged over independent trials with 
seven distinct partitions of the 130 chains into learning and 
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test set (7-fold cross-validation). The use of multiple cross- 
validation is an important technical detail in assessing per- 
formance, as accuracy can vary considerably, depending 
upon which set of proteins is chosen as the test set. For 
example, Salzberg and Cost (3) point out that the accuracy of 
71.0% for the initial choice of test set drops to 65.1% 
"sustained" performance when multiple cross-validation is 
applied — i.e., when the results are averaged over several 
different test sets. We suggest the term sustained perfor- 
mance for results that have been multiply cross-validated. 
The importance of multiple cross-validation is underscored 
by the difference in accuracy of up to six percentage points 
between two test sets for the reference network (58.3- 
63.8%). 

Use of Multiple Sequence Alignments 



It is well known that homologous proteins have the same 
three-dimensional fold and approximately equal secondary 
structures down to a level of 25-30% identical residues (11). 
With appropriate cutoffs applied in a multiple sequence 
alignment (12), all structurally similar proteins can be 
grouped into a family, and the approximate structure of the 
family can be predicted, exploiting the fact that the multiple $ 
sequence alignment contains more information about the 
structure than a single sequence. The additional information 
comes from the fact that the pattern of residue substitutions 
reflects the family's protein fold. For example, substitution of 
a hydrophobic residue in the protein interior by a charged 
residue would tend to destabilize the structure. This effect 
has been exploited in model building by homology— e.g. in g ^ 
re f. 13— and in previous attempts to improve secondary- 
structure prediction (14-18). Our idea was to use multiple | 
sequence alignments rather than single sequences as input to | f 
a neural network (Fig. 1). At the training stage, a data base 1 
of protein families aligned to proteins of known structure is i] 
used (Fig. 2). At the prediction stage, the data base of 
sequences is scanned for all homologues of the protein to be 
predicted, and the family profile of amino acid frequencies at 
each alignment position is fed into the network. The result is 
striking. On average, the sustained prediction accuracy in- 
creases by 6 percentage points. If single sequences rather 
than profiles are fed into a network trained on profiles, the 
advantage is generally lost. 

Balanced Training 

Most secondary-structure prediction methods have beenl 
optimized exclusively to yield a high overall accuracy. Thisi 
method can lead to severe artifacts because of the very ;|| 
uneven distribution of secondary-structure types in globular ! 
proteins: 32% a-helix, 21% j3-strand, and 47% loop (our data | 
base). Usually, loops are predicted quite well, helices are|| 
predicted medium well, and strands are predicted rathejr|| 
poorly. This imbalance can be corrected if the network isjf 
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K Fig 1 Network architecture. A sequence profile of a protein family, rather than just a single sequence, is used as input to a neural network 
W structure prediction. Each sequence position is represented by the amino acid-residue frequencies derived from multiple sequence alignments 
as taken from the homology-derived structure of proteins (HSSP) data base (12). The residue frequencies for the 20-residue types are represented 
ifby 3 bits each (or by one real number). To code the N- and C-terminal ends an additional 3 bits are required (or one real number). The 63 bits 
Ibriginating from one sequence position are mapped onto 63 (21 for real numbers) input units of the neural network. A window of 13 sequence 
iWtions, thus, corresponds to 819 (273) input units. The input signal is propagated through a network with one input, one hidden, and one output 
T&ver The output layer has three units corresponding to the three secondary-structure states, helix, /3-strand, and loop, at the central position 
lof the input sequence window. Output values are between 0 and 1. The experimentally observed secondary structure states (19) are encoded 
f ?as 1 0 0 for helix * 0, 1 ,0 for strand; and 0,0,1 for loop. The error function to be minimized in training is the sum over the squared difference between 
Current output and target output values. Net cascade: the first network (sequence-to-structure) is followed by a second network (structure- 
lito-structure) to learn structural context (not shown). Input to the second network is the three output real numbers for hehx, strand and loop 
£fix>m the first network, plus a fourth spacer unit, for each position in a 17-residue window. From the 17 x (3 + 1) = 68 input nodes the signal 
fris propagated via a hidden layer to three output nodes for helix, strand, and loop, as in the first network. In prediction mode, a 13-residue sequence 
^window is presented to the network, and the secondary-structure state of the central residue is chosen, according to the output unit with the 
'{largest signal. 



[^trained with each type of secondary structure in equal 
II proportion (33%), rather than in the proportion present in the 
I data base or anticipated in the proteins to be predicted. The 
f result is a more balanced prediction (Fig. 3; Table 1), without 
iaffecting, negatively or positively, the overall three-state 
-»j|^accuracy. A similar result was reported by Hayward and 
J||t| Collins (22). The main improvement is in a better /3-strand 
- J||rprediction, the most difficult of the three states to predict. 
|Jjhe method maintains full generality— i.e., it is equally 
Inapplicable to all-a, mixed a)3, and all-/3 proteins. No knowl- 
|*edge of the structural type of the protein is required, as is the 
r ; case for methods optimized on particular structural classes 
V 23). 

^Training on Structural Context 

& 

, Even if a prediction method has high overall accuracy and is 
|well balanced, it can be woefully inadequate in the length 
^distribution of the predicted helices and strands. For exam- 
s' pie, the reference network predicts too many short strands 
U and helices and too few long ones (Fig. 4). The predictions of 
this network appear fragmented compared with typical glob- 
ular proteins. Published prediction methods have similar 



shortcomings in the length distribution of segments to various 
extents, except for two methods that optimize the sum of 
segment scores by dynamic programming (W. Kabsch and 
C.S., personal communication and ref. 24). The shortcoming 
is partly overcome here by feeding the three-state prediction 
output of the first, "sequence-to-structure," network, into a 
second, ''structure-to-structure," network. The second net- 
work is trained to recognize the structural context of single- 
residue states, without reference to sequence information. 
Training it is very similar to that used for the sequence-to- 
structure network. The output string of the first network— 
e.g., the partially incorrect string HHHEEHH (two 0-strand 
residues in the middle of a helix)— becomes the input to the 
second network and is confronted with correct structure 
HHHHHHH, a helical segment. Network couplings are 
optimized to minimize the discrepancy. The addition of the 
structure-structure network increases the overall accuracy 
only marginally but reproduces substantially better the length 
distribution of helices and strands. A simple way of measur- 
ing the quality of segment lengths is to compare the average 
length of helices and strands in the data base to those in the 
predicted structures «L„) = 6.9, (Lp) = 4.6, Fig. 4). A similar 
second-level network was used by Qian and Sejnowski (5), 
but no effect of improved prediction of segment lengths was 
reported. 
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training set 



test set 



(g) protein family 
• known 3-D structure 
o homologue 
distance = % of non-identical residues 

Fig 2 Partition of protein families into training and test set. The 
str^cmmily known representatives of the families used for trammg 
^network have a distance of at least 75% to those used for testing 
sequenc^^^ in percent nonidentical residues; drawn sche- 

S y Each family contains homologous sequences, defined as 

rhoseS 
dimensional. 

"Jury of Networks" 

An additional two percentage points in overall accuracy were 
gained by a jury of networks that predicts by simple majority 
vote of a set of 12 different networks. The increased accuracy 
is an effect of noise reduction, mitigating the ill effects of 
incomplete optimization when any single network settles into 
a local minimum of the error function. 



0 AU) 
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Overall Improvement 

The final jury of networks outperforms all known methods in 
overall accuracy, balanced 0-strand prediction, and length 
distribution of segments as follows. onnintK 
(0 The overall accuracy is 69.7%, three percentage joints 
above the highest value reported so far [66.4% (4 ] The 
actual improvement may be larger, as their test set has 
sequence similarities of up to 46% relative to the training set 
The improvement is six percentage points relative to the be t 
classical method tested on our data base [63.4%, al > (20)]. 
For a new protein sequence, one can expect a prediction 
accuracy beiween 61% and 79% (1 SD about the average over 

Table 1. Observed versus predicted matrix for best method of Fig 



Fir 1 Testing five secondary-structure prediction methods on* 
the ame set Steins reveals the contribution of different devces'. 
o ,heTm P rover„, of accuracy, Q^. overall ^°^.on 
for the three slates (helix, strand, loop; number ° ^"^'^ 
correctly divided by the total number of residues). 2h e «x and Q«™*>$ 
St on accuracy calculated separately for helix and strand (e.g., 
numbTof helix residues predicted correctly divided by number of, 
ob^ed helix residues): The methods tested on our data base are. 
SfflS) fisMevel network with no balanced learning ^and no 6 
profile! " reference net), a two-level network cascade with balanced, 

wfth profiles and balanced learning (net with profiles) and 12 
oieCetworks combined by majority vote ^ury oflj^* 
groups achieve higher accuracy than does alb. but accuracy, 
values are not strictly comparable, as they are based or .differ-, 
en test data sets and, in part, on test protems wUh detectable, 
ejuence similarities to proteins on which the 
Values for (Oh*. G^d, Qioop) are « ^5% (65 • 45, J4) comb n e 
f2V 63 0% (58, 54 , 68), simpa (1); and 66.4%, Zhang el at. t«M 
ObWrved versus predicted matrix for the best method ,s mdicated m . 
Table 1. S 
individual proteins of 70.2%), provided several homologous* 
sequences are available. Values for three-state accuracy;; 
hould not be confused with those for two-state accuracy (9 
23) Two-state predictions-e.g., for the state helix/, 
nonhelix-carry less information and have a base value , for, 
Random prediction of 50%-i.e., 17 percentage points higher;. 

If corct pSiction, given a residue predicted in a particular 
state— are 72% helix and 57% strand. 

Wi) The length distribution of segments is more protein- 
like" (Fig. 4) Unfortunately, the length distribution is not 
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Correlation 


5552 
517 
1548 
7617 
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length of strand segments 

Fig. 4. Deviation in the length distribution of observed and 
^predicted segments is an additional criterion by which prediction 
methods can be evaluated, (a) Difference in the length distribution of 
[|fheiix segments — i.e., number of observed segments in a given length 
grange minus number of predicted segments, (b) Difference in the 
^length distribution of strand segments. Predictions by the simple net 
pi (no profile, not balanced, no cascade) result in too many short 
[vjsegments, too few long segments; prediction by the jury of 12 nets 
Irresults in a length distribution much closer to the observed one. 
^•Average segment lengths areas follows: reference net, (L a ) = 4.2 and 
fi (Lfi> - 2.9 residues; jury of 12, (£«) = 8.9 and (Lp) = 5.1 (observed: 
|S;(Z,«> - 9.0 and « 5.1). 



^generally given in the 
~& inferior in this regard. 



literature, but most methods are 



Tests on Completely New Proteins 

||| :: How accurate are predictions likely to be in practice? As a 
^ final check, the network system was trained on the full set of 
151 sequence families of known structure and then tested on 
26 protein families for which a first x-ray or NMR three- 
-dimensional structure became available after the network 
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architecture had been finalized. None of these additional test 
proteins had >25% sequence identity relative to any of the 
training proteins (Fig. 5). In this final set, 72% of the observed 
helical and 68% of strand residues were predicted correctly. 
The overall three-state accuracy for this set of completely 
new protein structures was 70.3%. 

Predictions via Electronic Mail 

Secondary-structure predictions using the currently best 
version of the profile network from Heidelberg (phd) are 
available via electronic mail. Send a message containing the 
word "help" to PredictProtein@EMBL-Heidelberg.de. In 
practice, the predictions give a good first hypothesis of the 
structural properties of any newly sequenced water-soluble 
protein and may be an aid in the planning of point-mutation 
experiments and in the prediction of tertiary structure. 

Conclusion 

There are two important practical limitations: most of the 
advantage of the current method is lost when no sequence 
homologues are available; and the method in its current 
implementation is not valid for membrane proteins and other 
nonglobular or non- water-soluble proteins. 

A major limitation in principle of the current method lies in 
its limited goal: secondary structure is a very reduced de- 
scription of the complexities of three-dimensional structure 
and carries little information about protein function. How- 
ever, as long as reliable prediction methods for protein 
three-dimensional structure and function are not available, 
secondary-structure predictions of improved quality are use- 
ful in practice — e.g., for the planning of point-mutation 
experiments, for the selection of antigenic peptides, or for 
identification of the structural class of a protein. Indeed, 
interest in the community is substantial: during 6 mo. since 
submission of this manuscript, >3,000 predictions for a wide 
variety of sequences have been requested and served via 
electronic mail. 

Looking ahead, we would not be surprised to see increas- 
ingly successful use of evolutionary information in attempts 
to predict more complex aspects of protein structure and 
function. Sequence families grouped around one structure as 
well as structural superfamilies with common folds but di- 
vergent sequences (26, 27) contain a wealth of information 
not available 14 yr ago at the time of the first attempts at using 
homologous sequences for improved prediction (16). Having 
posed the puzzle of protein folding, evolution may hand us. 
the key to its successful solution. 

Note Added in Proof. Since the submission of this paper (April 1993) 
the method described has been improved further. By explicitly using 



number 1 2 3 4 5 6 7 8 

sequence AFDGTOKVDRNENYEKFMEKMGINWKRKLGAHDNLKLTITQEGNKFTVKESSNFRNIDWFELGVDFAYSLADGTELTG 

observed EEEEEEEEE HHHHHHH HHHHHHH EEEEEEE EEEEEEE EEEEEEE EEEE EEEE 

predicted EEEE HHHHHHHHHHHHHHHHHHHH EEEEEE EEEEEEE EEEEEEEEEE EHHEE EE 

number 9 0 X 2 3. 

sequence TWTMEGNKLVGKFKRVDNGKELIAVREISGNELIQTYTYEGVEAKRIFKKE 

observed EEEEE EEEEEEEE EEEEEEEEE EEEEEEEE EEEEEEEEE 

predicted EEEE HHEEEEEE HHHHHHHHH EEEEEEE EEEEEEE 

Fig. 5. Example of prediction for a protein sequence by the currently best method. The 0-barrel structure of intestinal fatty acid-binding 
fcprotein has just become available through Protein Data Bank (code lifb (25)1. Prediction accuracy is 71.8%. in this ^-sandwich structure, 8 out 
|f of the 10 /^-strands are predicted correctly (one strand is ambiguous, and one strand is predicted as helix, but the ends of the segment are correct), 
J: farid the two helices are predicted as one long helix (E: strand, H: helix). For all 26 new protein chains, including lifb, overall accuracy averaged 
^ over single residues is 70.3%; averaged over single proteins, it is 71.1%. The estimated probabilities of correct prediction, given a residue 
/ ; predicted in a helix, strand, or loop were 69%, 58%, or 77%, respectively (see text for probabilities relative to the number of residues observed 
MFk in the three states). These 26 protein chains were not available publicly at the time of development of the method and were only used once in 
a final test of the currently best method. They are as follows: lace, lcox, lcpk_E, ldfn_B, 5enl, lOg, 3fgf, 2gbl, lgly, lgmLA, Ihcc, lhdd.C, 
f t 2hip_B, lifb, lmsb-A, lnsbJB, 5p21, lpi2, 2pk4, lrop-A, lsar-A, 2scp-A, lsnv, 3trx, 3znf, 2zta_A (all taken from the Protein Data Bank prerelease 
|; of July 1992; membrane proteins and proteins with many metals or SS bridges were not considered). 
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conservation weights and the numbers of insertions and deletions in 
the multiple sequence alignments as input to the network system, the 
sustained overall three-state accuracy becomes 71.4% on the same 
data set used in this paper. 
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ABSTRACT Protein structural flexibility 
is important for catalysis, binding, and al- 
lostery. Flexibility has been predicted from 
amino acid sequence with a sliding window av- 
eraging technique and applied primarily to 
epitope search. New prediction parameters 
were derived from 92 refined protein structures 
in an unbiased selection of the Protein Data 
Bank by developing further the method of Kar- 
plus and Schulz (Naturwissenschaften 72:212- 
213, 1985). The accuracy of four flexibility pre- 
diction techniques was studied by comparing 
atomic temperature factors of known three- 
dimensional protein structures to predictions 
by using correlation coefficients. The size of 
the prediction window was optimized for each 
method. Predictions made with our new param- 
eters, using an optimized window size of 9 res- 
idues in the prediction window, were giving the 
best results. The difference from another previ- 
ously used technique was small, whereas two 
other methods were much poorer. Applicability 
of the predictions was also tested by searching 
for known epitopes from air ino acid sequences. 
The best techniques predicted correctly 20 of 31 
continuous epitopes in seven proteins. Flexibil- 
ity parameters have previously been used for 
calculating protein average flexibility indices 
which are inversely correlated to protein stabil- * 
ity. Indices with the new parameters showed 
better correlation to proteir stability than those 
used previously; furthermore they had relation- 
ship even when the old parameters failed. 

© 1994 Wiley-Liss, Inc. 

Key words: dynamics, flexibility index, protein 
stability, antigenic regions, 
epitopes 

INTRODUCTION 

Protein molecules are dynamic being in constant 
motion. Structural flexibility is essential for activ- 
ity but, on the other hand, structural stability re- 
quires rigidity. 1 ' 3 Flexible regions are found in cat- 
alytic sites, 4 - 7 binding sites, 8 antigenic regions, 9 
sites susceptible for proteolytic cleavage, 10 allosteric 
hinge sites, 11 etc. Proteins with similar functions 
have similar excess of flexibility in their optimum 
reaction conditions. 2 - 4 

© 1994 WILEY-LISS, INC. 



The core of a globular protein is relatively ightly 
packed. Surface residues are generally more nobile 
due to fewer stabilizing interactions. Expos* d sur- 
face loops are the most flexible and show the argest 
sequence variation. The time scale of protein mobil- 
ity is very wide, the fastest vibrations and notions 
requiring only 1(T 14 to 1(T 13 s. Mobility can )e sim- 
ulated with molecular dynamics. Although tae sim- 
ulations are relatively short plenty of value ble in- 
formation is available. The flexible regions can be 
predicted using less accurate methods even without 
structural information. 

Three techniques have been used for predicting 
protein flexibility from amino acid sequence. The 
methods of Karplus and Schulz 12 (KS) and of 
Bhaskaran and Ponnuswamy 13 (BP) are b; sed on 
parameters derived from three-dimensiona struc- 
tures. Ragone et al. 14 (R) base their approa :h on a 
combination of hydropathy predictions and amino 
acid volumes. Flexibility analysis can be ised to 
search for the most mobile and thus possiMy also 
the surface residues in a sequence, wh ch are 
thought to represent epitopes. For vaccine produc- 
tion, it would be of great value to be able to pre- . 
diet the antigenic regions of a protein fron its se- 
quence. Flexibility predictions have been ased in 
searching for continuous epitopes from am no acid 
sequences. 12 * 1415 Other epitope prediction methods 
include hydropathy, 1617 (3-turn propensity , 18 and 
joint prediction of hydropathy, surface acce; sibility, 
flexibility, and secondary structure. 19 Stein 20 has 
recently reviewed the methods. 

The KS method uses normalized 5-value 3 of C n - 
atoms in 31 protein structures. Here we \ ave ex- 
tended the flexibility prediction by analyzin ? all the 
backbone atoms of 92 well refined structures from 
an unbiased selection of PDB. 21 To test the applica- 
bility of the predictions they were compan d to ex- 



Abbreviations: BP, flexibility according to Bhas! aran and 
Ponnuswamy; KS, according to Karplus and Schulz; ft, accord- 
ing to Ragone et al.; VTR, according to parameter derived 
here; PDB, Protein Data Bank. 
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peridental B-values. Their use in predicting anti- 
genic regions was studied with proteins for which 
loca :ions of continuous epitopes have been deter- 
min?d. Increased hydrophobicity and decreased 
flex bility have been shown to be the main stabiliz- 
ing principles in thermostable proteins. 22 Previ- 
ousl r we have shown inverse correlation between 
ther nal stability and structural flexibility by calcu- 
late g flexibility indices from normalized B-values 
for i mino acid sequences. 2 Even a better and more 
accurate correlation to stability was noticed when 
flex bility indices were calculated with the new pa- 
rameters. 

METHODS 

Ei tries for high resolution structures containing 
B-vs lues were taken from the unbiased selection of 
Prot >in Data Bank 21 (PDB) because there are a lot 
of re lundant data in PDB. Only 78 of the original 
102 t ntries could be used in this analysis because of 
miss ng or incomplete B-values or sequence infor- 
mal )n. Some of the chosen entries contained sev- 
eral proteins, so finally there were 92 different 
structures. The PDB entries were lbp2, leer, lcla, 
lcse 2 chains), Ictf, leca, lfc2 (2 chains), lgcr, lgdl, 
lgox lgpl, lhoe, lilb, 11dm, llzl, Imbd, lnxb, lpcy, 
lphh, lprc (4 chains), lr69, lsgt, lsn3, ltnf, lubq, 
lutg Iwsy (2 chains), 2aza, 2cab, 2ccy, 2cdv, 2ci2, 
2cpp 2cts, 2cyp, 2fb4 (2 chains), 2gbp, 2gn5, 2hhb (2 
chaii s), 2hla (2 chains), 2hmg (2 chains), 21bp, 21h2, 
21tn 2 chains), 2mhr, 2ovo, 2paz, 2pfk, 2rnt, 2rsp, 
2sga : 2sod, 2tsl, 2wrp, 3adk, 3gap ; 3grs, 3ins (2 
chairs), 31zm, 3rn3, 3tln, 451c, 4cha, 4fdl, 4hvp, 
4pep ; 4xia, 5atl (2 chains), 5cpa, 5pti, 5rxn, 6acn, 
7api 2 chains), 8adh, 8cat, 8dfr, 9pap, 9wga. Nor- 
mals ed B-values derived from the unbiased struc- 
tures ft were used both for flexibility prediction and 
the c dculation of flexibility indices. Computer pro- 
grams were developed to be compatible with the 
GCG program suite. 23 

Calci ilation of Normalized B-Values 

Thr selection of 92 unbiased protein structures 
was i sed to derive normalized B-values. Tempera- 
ture i actors of the backbone atoms N, C a , C, and 0 
were taken from the PDB. 24 The Karplus and 
Schul s 12 approach of determining normalized B-val- 
ues was repeated with our extended database. The 
threshold values are those previously used. The 
B-val tes of each protein were normalized so that the 
mean was 1.0 and the root mean square deviation 
0.3. B ised on its deviation from the mean, each res- 
idue t *pe was defined as flexible or rigid. Those with 
average B norm values below 1.0 were denoted as 
rigid. In the next step normalized B-values were de- 
termi: led for each residue type when surrounded by 
none, one or two rigid neighbours to obtain B nortn0 , 

^ norm » 

and B norm2 tables, respectively. Because 
chain termini are usually very flexible and could 
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have caused bias, three N- and C-terminal residues 
were omitted frorr each structure. 

Programs for Flexibility Prediction 

Program FLEX was implemented for flexibility 
predictions with o ar new B norm tables and with the 
parameters of Bh.iskaran and Ponnuswamy 13 and 
Ragone et al. 14 The antigenic index of Jameson and 
Wolf 19 was also included. The sequence can be read 
either from a PDB or a GCG file. The predictions are 
based on a sliding window averaging technique. The 
optimized window size for each technique is used: 
five for R, seven for BP, and nine residues for our 
parameters and those of KS. The propensities for the 
residues inside tr.e window are summed up and 
given for the resic ue in the middle of the window. 
The weighting of residues inside the window is 0.25, 
0.4375, 0.625, 0.8125, 1, 0.8125, 0.625, 0.4375, and 
0.25 from left to right in techniques using B norm 
values but has a constant value of 1 in the methods 
of BP and R. The flexibilities of KS and ours are 
calculated as follows. First the number of rigid 
neighbors around iiach residue is determined. Then 
the neighbor correlated weighted propensities from 
B norni tables are summed and given for the middle- 
most residue after which the window is shifted by 
one residue. The results can be presented with the 
program FLEXPLOT on several graphics devices. 
Experimental B-values are shown for the backbone 
atoms of proteins ia PDB entries. 

Testing Accuracy of the 
Flexibility Predictions 

The accuracy of the different flexibility prediction 
methods was studit d by determining correlation co- 
efficients. The B-values for each of the proteins were 
compared to predicted flexibilities by calculating 
correlation coefficients. Many PDB structures con- 
tain one or just few highly flexible residues due to, 
e.g., lattice disorcleis. To see if the high peaks might 
bias the analysis, the B-values of residues in each 
protein were scaled from 0 to 100%. If only one res- 
idue had flexibility higher than 80 or 90%, its value 
was reduced to that of the second highest residue 
and the analysis was repeated until there were res- 
idues also on intervals 80 to 90% and/or 90 to 100%. 
The correlation coeJficients were determined for the 
entries both when smoothed by sieving the high 
peaks and when untreated. 

Optimization of tt-.e Flexibility 
Prediction Techniques 

The flexibility prediction techniques use the slid- 
ing window averaging technique. The only adjust- 
able parameter in the R and BP methods is the 
width of the window., i.e., number of consecutive res- 
idues used in the prediction at a time. In the method 
of KS the window was originally fixed to seven res- 
idues but the residuss inside the window had differ- 
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ent weighting depending on their location within 
the window. The length of the window was opti- 
mized for all four prediction techniques by determin- 
ing correlation coefficients and maximizing infor- 
mation contents with window lengths 5 to 15 
residues. In addition also the effect of residue 
weighting was. studied by giving the weight of 0.25 
for the first and last residues in the window and 1.0 
for the middlemost. The weights of the others were 
at equal spacing between those two values. 

Calculation of Flexibility ) ndices 

Atomic temperature factors (B-values) obtained 
during crystal structure determination are a mea 
sure of the flexibility of the ::esidues in the protein. 
We have used normalized © values to calculate av~ 
erage flexibility indices for the whole protein mole- 
cule. Since the flexibility of a residue is dependent 
on the nature of neighboring residues, three param - 
eter tables are used. There have been two ways to 
calculate average flexibility indices. 2 The F index is 
calculated from 

n-l 
i = 2 

where n is the number of residue and B nc is neighbor 
correlated normalized B-value for the residue type. 
Another equation, F 7 , gives different emphasis for 
the chain termini 

n-7 
i = 8 

where /; = [B nc , + 0.760V*-! + B nc ,,. + 1 ) + 
0.5(B nCii _ 2 + B nc>i + 2 ) + 0.:!5(B nc-i _3 + B„ c , i + 3 )]/4. 
Now that window size nine was found to be optimal 
in predictions with normalized B-values a new equa- 
tion was determined 

n-9 

*9 = S fiU- 10) 
i=10 

where /l = [B nc ,,. + 0.812£(B nc ^ + B nci+1 ) * 
0.625(B nc ,,._ 2 + S ncj + 2 ) + (i.4375(B nc ,,._ 3 + S nci+ ,) 
+ 0.25(B nc ,,._ 4 + B nc>1 + 4 )]/i5.25. 

i 

RESULTS AND DISCUSSION 
New Flexibility Parameters 

Three methods have been used to predict protein 
structural flexibility from sequences. 12 " 14 The pa- 
rameters for the KS- and BP- techniques were de- 
rived from known 3D structures, whereas those for 
the R-technique are combined from other predic- 
tions. A limited set of 31 structures was used in the 
KS method to determine prediction parameters, 
whereas BP had only 19 proteins. We have extended 
the analysis to 92 refined structures. We reimpie- 
mented the KS algorithm because we found it gave 
the most accurate predictions. All these techniques 



use a sliding window averaging technique; parame- 
ters are summed for a stretch of amino aci< s within 
a window which is shifted by one residue at a time. 
In the KS method residues have coefficient s depen- 
dent on the location within the window, chus the 
contribution of a residue to the prediction w alue de- 
pends on its distance from the middle of the window. 
Claverie and Daulmerie 25 argue that smoothing of 
the prediction curves by weighting is advai .tageous, 
since pattern recognition is easier and the rregular 
variation of values is damped. The smoothi lg is also 
better for detecting local maxima which e re of im- 
portance, e.g., in epitope analysis, von Hei.ne 26 has 
used a related trapezoid weighting schemt in anal- 
ysis of membrane spanning segments. 

The proteins for calculating normalized B-values 
were taken from an unbiased selection of tl e PDB. 21 
Normalized B-values, hereafter VTR parameters, 
were calculated from the 92 structures (Tal le I). The 
major difference to those of KS is that wt have 11 
rigid residues instead of 10. Threonine is classified 
as a rigid amino acid, because its average B norm is 
below 1. The order and values of residues have 
changed. Glycine is generally considered to be the 
most flexible amino acid. It has the higl est value 
both in BP and R tables but not in the KS table. In 
our analysis it is found to be flexible but there are 
still seven more flexible residue types. T lis might 
be because the more flexible residues, \vh:ch are all 
charged or polar except for proline, appear mainly 
on surface whereas glycine is also found i a the pro- 
tein interior. As the normalized B-values are aver- 
ages the restricted mobilities of buried gl; cine resi- 
dues may reduce the overall value. Another 
explanation might be frequent occurrence in tight 
turns having restricted mobility. The vah es for gly- 
cine are the most neighbour dependent. ^Vhen sur- 
rounded by one or two rigid residues it is among the 
most flexible residues. 

The new B norm values were used in flex; bility pre- 
diction. If the sequence in program FLE X is read 
from PDB file B-values are averaged for the back- 
bone atoms. The predictions and the B--alues are 
presented with program PLOTFLEX. Fo • the plots 
the values of BP and R tables were norm; lized to be 
from 0 to 1. In the original R parameter zation the 
most flexible residue had the lowest vah e thus the 
numbers were inversed to be comparable to the oth- 
ers. 

Accuracy of the Flexibility 
Prediction Techniques 

The accuracy of the techniques was 1 3sted with 
correlation coefficients method. The pred ction win- 
dow was adjusted from 5 to 15 residues a id the pre- 
diction accuracy was followed when t le highest 
B-value peaks were either smoothed o *. not. The 
means of the correlation coefficients over the 92 pro- 
teins in Table II shows that the optimal window in 
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— TABLE L Neighb ° r Cor " la ^^^ Fl exibility Parameters ^ the 92 Protein Structures 



ResiL 

W 
C 
F 
I 

Y 
V 
L 
H 
M 
A 
T 
R 
G 

Q 
S 
N 
P 
D 
E 
K 



Count 



Count 



264 
333 
708 
926 
646 
1297 
1505 
457 
349 
1499 
1057 
764 
1529 
674 
1171 
794 
857 
1011 
1027 
1038 



Value 



0.904 
0.906 
0.915 
0.927 
0.929 
0.931 
0.935 
0.950 
0.952 
0.984 
0.997 
1.008 
1.031 
1.037 
1.046 
1.048 
1.049 
1.068 
1.094 
1.102 



Count 



51 
68 
159 
208 
144 
296 
346 
112 
85 
432 
300 
225 
491 
213 
378 
255 
295 
371 
396 
420 



BP m Bthod was seven residues and five in R We can 
see that the R method is the overall poorest tech- 
nique . This can be understood from the origin of the 
parar teters which were obtained by multiplying res- 
idue 1 ydrophobicities by volumes without using any 
struct ural analysis. The three methods giving better 
correlation coefficients are based on three-dimen- 
sional structures. 

The, optimization of the predictions with the VTR 
and h 3 parameters required weighting of the resi- 
dues in the window. The window sizes tested were 
from I to 15 residues. The weights of the outermost 
residues were 0.25 and 1 for that in the middle. The 
value D.25 was chosen to give some emphasis also for 
the ends of windows. The optimal window size was 
nine f.. r both VTR and KS techniques (Table II), the 
latter i£ which previously used seven residues The 
correlation coefficients with the optimized predic- 
tion te :hmques for all the 92 protein structures with 
all the four techniques are as follows: VTR 0 3304 
KS 0.: 356, BP 0.2428, and R 0.1659. Clearly the 
best r« suits were obtained with the VTR and KS 
methols, the other two being much poorer. The pre- 
dictive power varies greatly in each technique. The 
best correlation coefficients are close to 0.8 whereas 
the pqc rest values are close to 0. All the tested meth- 
ods predicted poorly some proteins. The results show 
that in the KS method the previously used window 
of 7 residues is not optimal. The best results are 
obtain* d with nine consecutive amino acids. The 
new p £ rameters were better than those of KS with 
short (.; to 7 residues) and large (15 residues) win- 
dow bu ; the differences are not significant. 

Man;- proteins have one or only a few residues 



1.186 
1.196 
1.247 
1.241 
1.199 
1.235 
1.234 
1.279 
1.269 
1.315 
1.324 
1.310 
i.382 
L.342 
1.381 
1.380 
1.342 
L372 
L376 
1.367 



60 
79 
154 
213 
165 
325 
365 
121 
75 
338 
270 
186 
359 
161 
279 
221 
201 
226 
253 
278 



Value 

0.938 
0.939 
0.934 
0.977 
0.981 
0.968 
0.982 
0.967 
0.963 
0.994 
0.998 
1.026 
1.018 
1.041 
•1.025 
1.022 
1.050 
1.022 
1.052 
1.029 



Count 



Value 



153 
186 
395 
505 
337 
676 
794 
224 
189 
729 
487 
353 
679 
300 
514 
318 
361 
414 
378 
340 



0.796 
0.785 
0.774 
0.776 
0.788 
0.781 
0.783 
0.777 
0.806 
0.783 
0.795 
0.807 
0.784 
0.817 
0.811 
0.799 
0.809 
0.822 
0.826 
0.834 



with very large B-v.ilues due to, e.g., static disorders 
in crystal lattice. These residues produce high peaks 
on B-value curves and might bias the parameters for 
those residues. This; could be avoided by smoothing 
the curve, but no real effect on predictability was 
seen, e.g., in the case of the VTR method the un- 
smoothed value with window size 9 was 0.3302 
while it was 0.3304 cor the smoothed data. The same 
order of impro veme at was noticed also in the other 
three methods. Many of the highest peaks were al- 
ready filtered away when the three N- and C-termi- 
nal residues of each protein were not included in the 
calculation of prediction parameters. This was done 
because the ends are known to be exceptionally flex- 
ible. 

The use of backbone atoms was tested comparing 
predictability to parameters derived from C a atoms 
of the 92 proteins. Correlation coefficients were de- 
termined for both parameter sets and the mean was 
found to be 0.330 for the backbone derived data with 
window 9 whereas is was 0.320 for the tables derived 
from C a atoms. The improvement in the backbone- 
derived data is surprisingly small. The same result 
can be noticed when comparing the results of back- 
bone parameters (VTR scale) to those of C param- 
eters (KS table). It sterns that in KS analysis there 
were enough data to bring the predictability with 
this sort of technique close to its maximum and our 
data for 17,906 amino acids did not change it much. 

We tested the prediction methods also with struc- 
tures not included in our data set; 38 randomly se- 
lected structures fro m a later version of PDB not 
having significant sequence similarity to the pro- 
teins used in the derivation of the parameters were 
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TABLE II. Optimization of Flexibility 
Prediction Windows* 



Window 
size 



Prediction technique 



VTR 



KS 



BP 



R 



5 


0.3158 


0.3112 


0.2345 


0.1659 


7 


0.3266 


0.3283 


0.2428 


0.1655 


9 


, 0.3304 


0.3356 


0.2387 


0.1644 


11 


0.3280 


0.3332 


0.2219 


0.1602 


13 


0.3204 


0.3235 


0.2092 


0.1628 


15 


0.3125 


0.3142 


0.2030 


0.1645 



*The overall correlation coefficients for all the chosen 92 pro- 
teins were determined with different prediction window sizes. 



analyzed. Here, too, predictions with the VTR and 
KS methods are giving best results (mean values 
0.3359 and 0.3260, respectively), R and BP scales 
are clearly the worst ones (mean values 0.2460 and 
0.2596, respectively). The accuracy of the predic- 
tions are of the same order as for the structures used 
to derive the parameters, but ;he differences are not 
significant. The VTR is somewhat better than the 
KS method. The most striking result is an increase 
in the predictability of the R method. VTR and KS 
parameters ar the best and the new parameters are 
somewhat more accurate. 

The applicability of the flexibility predictions is 
shown for myoglobin in Figure 1. The VTR and KS 
plots resemble each other although the new scale is 
discriminating flexible and rigid regions more 
sharply, which is advantageous in searching for an- 
tigenic regions. The flexibilities of the two tech- 
niques follow quite well the shape of the B-value 
curves, although the predicted curves are smoother. 

The flexibility predictions and experimental 
B -values could further be compared with the pro- 
gram MULTICOMP, 27 a multiple sequence compar- 
ing tool which can also be u:>ed for comparing pre- 
dictions. Prior Xo this kind of analysis the B-values 
and flexibility propensities h ive to be normalized to 
express the same range of values. This approach has 
also been used to compare hydropathy predictions by 
comparing two different methods of predicting hy- 
di*opathic character on the same protein. 28 

•* 

Prediction of Antigenic Sites 

The protein surface serves as a template for nu- 
merous antibodies. Some of t he epitopic regions are 
formed by consecutive residues. These regions have 
been determined for several proteins such as sperm 
whale myoglobin 16 (PDB entry 1 mbo), hen egg 
white lysozyme 16 (Uyz), totacco mosaic virus pro- 
tein 29 (2tmv), horse cytochrome c 16 (sequence entry 
echo), bovine serum albumin 30 (a36401), rotavirus 
major outer-shell glycoprotein 31 (vs09„rotsl), and 
hepatitis B virus core protein' 12 (nkvlah). The pro- 
teins contained although £!1 continuous epitopes 
when the N- and C-terminal regions were omittec. 



Since one or the major applications of the fie ability 
predictions has been epitope search all four predic- 
tion techniques were used to locate antigenic re- 
gions in the seven proteins. 

Each prediction technique was run with t le opti- 
mized window sizes. Since there are no genei il rules 
to locate the antigenic regions from plots, an as hav- 
ing some sort of peak in the epitope regit n were 
considered to match. The VTR, KS, and R \ arame- 
ters predicted correctly 20 of the 31 epitope i which 
means 65% success ratio. The BP method wi s much 
poorer giving only 13 correct regions, 42% mccess. 
These figures might be reasonably good for t his sort 
of simple method were there not also a high lumber 
of false positives. In Figure 1 we have included also 
the antigenic index, 19 which is specially n .ade for 
epitope search. However, it was most often indicat- 
ing some 60% of the sequence as highly ai tigenic, 
thus we did not consider that method at all 

Hydropathy profiles have generally been ased for 
searching epitopes. The method of Hopp and 
Woods, 33 perhaps the most often used prediction 
technique for this purpose, was used to ana lyze the 
same proteins. There were 21 correctly predicted 
sites indicating no difference in accuracy to flexibil- 
ity techniques. Because of the vague natui e of the 
flexibility we could not calculate the ratio 3 of cor- 
rectly and wrongly predicted regions. An mow, it 
could be noted that by far the best met lods for 
searching epitopes among the highest peaks in pre- 
dictions are VTR and KS. They also predicl id fewer 
false epitopes. The new parameters were b itter be- 
cause they separated the peaks more clearl y, which 
makes the interpretation of "the results cle irer and 
more accurate. The hydropathy predictions were 
made with program HYDRO. 34 Note that tl e hydro- 
philic regions are pointing down in the I: opp and 
Woods 33 prediction. 

Flexibility Indices 

The functional properties of a molecule a e a com- 
promise between flexibility and rigidity. T he corre- 
lation between averaged flexibilities am protein 
thermal stability has been verified with f exibiiity 
indices calculated from the normalized B- /alues of 
KS. 2 Here we used VTR parameters to calci date also 
F indices. The values determined with KS parame- 
ters are shown for comparison. The differen :es in KS 
results to those previously published are due to a 
minor error in the routine for calculating F 7 in the 
previous work. Several groups of enzyme s studied 
(Table III) indicated that the correlation so protein 
stability was even clearer with the new pa *ametei*s. 
Indices calculated with VTR parameters show cor- 
relation also in alanine dehydrogenases, g lucoamy- 
lases, serine proteases, and phosphoglye erate ki- 
nases, but not with those of KS. The i lexibility 
indices are comparable for proteins havir g similar 
function and folding. Because they do not take into 
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J[ABLE HI, Flexibility Indices of Some Protei ns* 



Alanii ? dehydrogenase 

Baa lus sphaericus IF03525 

Baci lus stearothermophilus IFO12550 
o-Amj lase 

Baci lus subtilis 168 

Strejlomycvs griseus IMRU 3570 

Baci lus amybliquefaciens 

Ap$( -gillus oryzac 

B. stearothermophilus ATCC 12980 

B. st arothermophitus NZ-3 

Baci lus licheniformis NCIB 8061 
|3-Amy ase 

Baci 'us circuhns NCIB 11033 

Closi -idium thermosulfurogenes ATCC 33743 
p-Gluc mase 

B. at tyloliquefaciens 

Baciilus macerans 

Cyclod. xtrin glycosyltransferase 

Kleb; iella pneumoniae M5al 

B. m icerans I AM 1243 

Bacil 'us circulans ATCC 21783 
Ferred* .tin 

Clost idium acidi-urici 

Clost idium tartarivorum 

Clost idium thermosaccharolyticum 
GIucoa:nylase 

Schh isaccharomycopsis fibuligera HUT 7212 

Schu inniomyces occidental is ATCC 26076 

Aspect Hu$ awamori 

Inorgaric pyrophosphatase 
Sacct aromyces cerevisiae 
Esc hi richia coli * 

Lactate dehydrogenase 

Lacto yacillus casei DMS 20011 

Baci I us psych rosaccharolyticus DSM 6 

Bacil. is megaterium 

B. su. til is XI 

Bacil, us caldotenax YT-G 

Bacih us caldolyticus 

B. ste irothennophilus NCIB 8924 

Then \us caldophilus GK 24 
Neutral protease 

B. an: vloliquefaciens 

B. sul titis 

B. cer us DSM 3101 

B. ste< rothermophilus CU21 

B. cal lolyticus YT-p 

Baci 11 :s thermoproteolyticus 

B. stet rothermophilus MK232 
Phosphc g\y cerate kinase 
S. cen visiae 

Thern us thermophilus HB-8 
Serine p j otease 
B. am, loliquefaciens 
Thern oactinomyces vulgaris 
Them us aquaticus YT-1 



Optimum 



Temperature 



Stability 



Ou r parameters KS parameters 
Jk_*L_ F 9 F 



42 
50-60 

80 



50%, t;5 0 C, 5 min 
50%, V&C, 5 min 

30%, r; 5 ° C , 10 min 



50 



60 
45 



70%, f 0°C, 30 min 

50%, £0°C, 2h 
100%, S0°C, 2h 

77%, 57°C, 1 h 
100%, 70°C, 1 h 

50%, 70°C, 4 min 
50%, 70°C, 9 min 

100%,45°C, 15 min( 
90%, 50°C, 15 min 
100%, 65°C, 30 min ( 

22%, 7TC, 2h 
53%, 70°C, 2h 
90%, 70°C, 2 h 



50 
52 
70 



20%, 50°C, 5 min 
35%, 9()°C, 5 min 

100%, 60°C, 5 min 
40/35 „ 

100%, 4tf°C, 30 min 
50-60 100%, 5S'C, 30 min 
60/70 100%, 6,'i°C, 30 min 
100%, 7U°C, 30 min 
55/60-70100%, 7.TC, 30 min 
100%, 9U°C ; 60 min 



75%, 6f °C, 20 min 
80%, 6/:°C, 30 min 
77 70%, 7( °C, 30 min 
30%, 9ll°C, 30 min 
45%, 90«C, 30 min 



Reference 



+ Ca) 
+ Ca) 



35 
35 



36, 37 



1.0041 1.0035 0.9897 0.9897 
0 !»958 0.9964 0.9903 0.9905 

1.0551 1.0548 1.0081 1.0080 

1.0223 1.0217 1.0002 0.9998 

1.0470 1.0484 1.0051 1.0054 

1-0125 1.0109 0.9978 0.9972 

1.0178 1.0187 0.9957 0.9963 

1.0186 1.0196 0.9967 0.9973 

U'295 1.0304 0.9971 0.9976 

1.0400 1.0411 1.0044 1.0050 
1.0048 1.0055 0.9936 0.9939 

1.(227 1.0190 0.9969 0.9954 
1.0106 1.0124 0.9956 0.9950 

1.C405 1.0395 i:0065 1.0058 
1.C226 L0221 1.0026 1.0026 
1.C202 1.0198 0.9979 0.9978 

0.9970 1.0048 0.9651 0.9702 
0.9622 0.9659 0.9659 0.9660 
0.9625 0.9663 0.9677 0.9681 



1.0475 1.0482 1.0068 1.0071 43 

1.0275 1.0279 0.9990 0.9991 44 

1.0173 1.0184 1.0067 1.0068 45,46 

1.0405 1.0420 1.0044 1.0038 

U)273 1.0245 0.9921 0.9912 



38 
39 

40 
40 



41,42 



1.0153 
1.0144 
1.0130 
1.0099 
1.0110 
1.0037 
1.0021 
0.9! )92 

1.0-107 
1.0;i84 
1.0:»92 
1.01)32 
1.01H5 

i.o;:60 

1.0260 



1.0143 
1.0107 
1.0117 
1.0108 
1.0090 
1.0020 
1.0003 
1:0015 

1.0449 
1.0418 
1.0295 
1.0252 
1.0234 
1.0273 
1.0272 



0.9923 0.9917 
0,9904 0.9890 
0.9906 0^9898 
0.9897 0.9898 
0.9870 0.9865 
0.9805 0.9794 
0,9796 0.9786 
0.9816 0.9821 

1.0128 1.0134 
1.0143 1.0154 
1.0045 1.0042 
0.9957 0.9965 
0.9958 0.9961 
0.9995 0.9991 
0.9998 0.9994 



55 1.0*38 1.0316 1.0004 0.9997 
>90 ; 1.0S14 1.0330 0.9898 0.9898 



47, 48 
48, 49 



50,51 
52 

53 
54 
55, 56 
57,58 
59 
58, 60 
61,62 

4, 63 
4, 64 



*If refer nee is not given i.t is mentioned in Vihinen 



50-75 
60-85 
80 



50%, 55'C, 40 min 
85%, 80'C, 3 h 



1.0c 40 1.0353 1.0041 1.0049 65 66 
1.0571 1.0271 1.0074 1.0076 67,' 68 
_UK02 1,0197 1.0Q50 1.0041 fiq[ 7n 
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Fig. 1 . Flexibility predictions, antigenic index, and experimental B-values of backbone atoms in sperm 
whale myoglogin (1m;x>). The flexibility prediction:? were obtained with optimized prediction windows. Con- 
tinuous epitopes are indicated with open boxes. 



account all the stabilizing f( trees some discrepancj 
has been noticed. 2 This can j;till be seen in the case 
of the most stable a-amylase.5, ferredoxins, and neu- 
tral proteases. Ferredoxins with extra stabilizing 
ionic bonds have been discussed previously. 2 In neu- 
tral proteases the higher stability is presumably 
gained by the extra Ca 2+ binding site. F 9 indices 
were determined also with ICS parameters, but tho 
results are not shown here because the difference to 
F 7 values were not highei than 0.0005, usuallv 
much less. 

The flexibility indices calculated with the VTH 
parameters had more pronounced correlation to sta- 
bility data than the KS parameters. Our values 
show correlation even when those of KS fail. This is 
presumably due to two reasons. Our structural da- 
tabase is larger. We also used the backbone infor- 
mation instead of C a atoms. All the atoms in resi- 
dues were not used because flexibility of side chair s 
does not mean that the backbone is also flexible and 
the flexibility of the protein backbone is typical'for 
surface regions and epitopes. Side chains can be 



rather mobile although the backbone is ligid. Be- 
cause one of the major applications of th»: method 
will be to search for epitopes and mobile exposed 
regions only the backbone data were used Another 
reason was that often the data for side ciiains are 
missing or are poorly determined. 

CONCLUSIONS 

New parameters were determined for pn diction of 
protein flexibility. The applicability was i :udied by 
comparing atomic temperature factors of erystallo- 
graphically determined proteins to.predic ions. The 
VTR and KS parameters were clearly tht best. We 
would suggest the use of a prediction wi;idow of 9 
residues and VTR parameters because they gave 
slightly better correlation on a test set of 38 pro- 
teins, because they separate flexible regions more 
clearly on plots, and because they gave m ich better 
correlation when used in flexibility indice:.. It seems 
that the accuracy of the sliding window te :hnique is 
approaching its limit and it might be diffi :ult to im- 
prove it significantly. The same sort of h mits have 
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also been met in secondary structural predictions 
wht re the average accuracy has been for a decade 
abo at ^the same in spite of numerous new meth- 
ods.' 1 - 72 These limits in prediction techniques are 
presumably due to intrinsic limitations of the sta- 
tistical methods, which cannot take into account all 
the different features of complicated protein struc- 
ture 5. Somewhat improved predictions might be ob- 
tain 3d with neural nets and other knowledge based 
syst ms. 

Flexibility profiles can be useful in several ways. 
Whin joined with sequence analysis and structural 
precisions, they can add to our understanding of 
protjins. In addition to being used for epitope 
searches, flexibility calculations can be applied in 
studies concerning sequence and structural similar- 
ity, nolecular modeling, and protein engineering. 
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We have developed a hybrid system to predict the secondary structures (a-helix, fl-sheet 

o n io°'^ R? ,n? aCl l! eVed 6G ' 4% ilCC " raCy ' With Nation coefficients of C coil = 
0-429 C ^ 0-470 and (?, = 0-387. This system contains three subsystems ("expert*" ' a 
neural network module; a statistical module and a memory-based reasoning module Fi.st 
the three experts independently learn the mapping between amino acid sequences and 
secondary structures from the known protein structures, then a Combiner learns to 
combine automatically trie outputs of the experts to make final predictions. The hybrid 
system was tested with 107 protein structures through k-wav cross-validation its 

ffllTdH 7? , tter r 6aCh CXpert and a " P revious '3' reported methods with greater 
than 0-99 statistical significance. It was obse.ved that for 20% of the residues all three 

™ Pr °f ' ( , 6 Sa '" e but Wr ° nR P redi(!ti, »" a - Thi « may suggest an upper bound on the 
a«.cu acy of secondary structure predictions based on local information from the currently 
available protein structures, and indicate' places where non-local interactions may play I 
dominant role ,n conformation. For 64% of th'b residues, at least two experts were the same 
and correct which shows that the Combiner performed better than majority vote For 77<> 
of the residues, at least one expert was correct, thus there may still be room for 
mprovement ,n this hybrid approach. Rigorous evaluation procedures were used in testing 
the hybrid system and statistical significance measures were developed in analyzing the 
differences among different methods. When measured in terms of the number of secondary 
structures (rather than the number of residues, that were predicted correctly, the prediction 
produced by the hybrid system was also bettor than those of individual experts 

Keywords: protein secondary structure prediction; hybrid system; neural networks; 
rr emory-based reasoning; statistical methods 



1. Introduction 

t\ . • ■ m . , higher order structures (e.g. super secondare «trnr- 

Determining the mapping between amino acid tures (Taylor & Thornton IWlST 

sequences and secondary structures (a helix, 0 sheet, el al 1987)7 ' ^ d ° n,am * ( ' ^"'I' 

etc.) ,s an important step towards our under- Many algorithms have been developed for motein 

Sir 8 , , ^ SeqU 'n CeS Specify thdr ^condary structure prediction 0 e , ' e i " 

overall structures and functions. Currently the main efforts was made by Chou & Fasinan (1074) 

r - TTZ , Tr Prot r St, '" CtUreS " X ""* Different ^mentations of tl eir a.C , i h te 

crystallography, which is a slow and often difficult all attained about a 50 to 60% level o ■ c » v n 

process. On the other hand, the database of known predicting the location of « hel L d \, 

protein sequences ,s growing very rapidly. Thus, it "coil" (i.e. anything other than helk r , r |) 

■s increasingly important to develop computational protein sequence. LieXZ t 

approaches to determine automatically (predict) the algorithm (Gamier et „ . 978 , l n ^ 

structures of proteins whose sequences are known. accurate for this task. More re' e, tlv he i, proved 

The correct prediction of secondary structures can algorithm (Gibrat et al lQfm i ' "'B 1 0X1,1 

™ SlheToTl^ f t0War ? thiS g0ah <&«£SS 1988) 3 a ,'t fici ' ™ I. 

example, the knowledge of secondary structures can . network algorithm o increase 'the predict" n u ' 

Enn n!r! ««f I "l'?' molecular achieved by other researchers (e.g. Kneller et al 

tK^^^ T" 1 ^ T de,S (Sk ;'" ick ,990; E0]h y & Kar Pl"s, 1989). Thus there 1 as been 

« Kolmski, l.)90), or can be used m predicting about a 0% improvement of prediction acc .racy in 

0022-2830/92/121040-15 S03.00/0 1 049 ^ , nn , , , 

' © 1992 Academic IV .« Limited 
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Amino acid sequence 

I igure 1. A window is moved along an amino a:id 
sequence to extract correlations between the residues and 
the secondary structure state of the center residue. 



the last 15 to 20 years, which is due to both the 
im >roved computational methods and the increase 
of the known protein structure data. Almost all 
these algorithms have- adopted a "local strategy 1 ': 
moving a "window" (typically covering 7 to 19 
res dues) along an amino acid sequence and predict 
the secondary structure state of the center residue 
in the window according to all the residues inside 
the window (see Fig. 1). To assess the accuracy of a 
prediction algorithm for proteins whose structures 
are not known, it is a common practice to divide the . 
kn >wn protein structure database into two separate 
set?: the "training data set" is used to set the 
pa ameters of the algorithm, and the "test data sot" 
is tsed to test its prediction accuracy. The preclic- 
tio is produced by the existing algorithms, though 
im lerfect. can often show the likelihood or tendency 
of certain peptide chains to. form particular 
secondary structures. It is also important to know 
tlu extent to which the protein structures are deter- 
mi led by "local interactions": interactions among 
res dues adjacent along the polypeptide chain. 

9 'hough existing prediction algorithms are all 
about 00 to 64% accurate for three-state (a-helix, 
/ks'ieet, and coil) prediction, they can make incor- 
rect predictions at different places of an amino acid 
sec uence. From the point of view of machine 
lea ning (artificial intelligence), second ary structure 
pre diction is an instance of inductive learning, gener- 
alising from known examples to solve 113 w 
pre blems. Different algorithms may work according 
to MilTerent principles and can generalize in different 
wavs. Therefore, a combination of different ak;o- 
ritams can potentially produce a better prediction 
tlu n individual ones. Based on this analysis, we 
developed a hybrid system to predict the secondary 
strictures, which indeed improved the prediction 
ace uracy significantly. Our hybrid system has th^ee 
diferent modules ("experts''): a neural network 
module, a statistical module and a memory-based 
reasoning module, and a Combiner. The experts 
we *e chosen in such a way that they have different 
mathematical properties. In the training phase, the 
experts independently learn the mapping between 
amino acid sequences and secondary structures from 



the known protein structures; the Combiner learns 
to combine automatically the outputs of the 
experts. Tn the prediction phase, the three experts 
make predictions separately, then the Combiner 
takes the predictions from the three experts and 
makes final predictions. K-way cross-validation was 
used in evaluating the hybrid system and statistical 
significance measures were used in comparing 
different predie tion algorithms. 

Our experiments showed that (1) the hybrid 
system had f,n overall prediction accuracy of 
6(M%, which was higher than individual experts 
and all previously reported algorithms at greater 
than 0*99 confidence level; (2) the three experts not 
onh' had very close overall prediction accuracy, 
their detailed predictions also agreed with one 
another much more than with the real structure (i.e. 
their prediction accuracy); (3) the accuracy of 
prediction algorithms could change as the test data 
changes, especially when the test data set was small 
(e.g. containing 15 protein sequences); (4) for 20% 
of the residues, all three very different experts 
produced the same but wrong prediction, suggesting 
that with the currently available protein structure 
data, 80% nay be the upper bound for the 
secondary structure prediction accuracy using the 
local strategy; (5) compared to each expert, the 
hybrid system also produced better result in terms 
of the number of secondary structures (rather than 
the number of residues) that were predicted 
correct! v. 



2. Methods and Materials 

(a) The architecture, and training of a hybrid system 

Figure 2 shows the overall architecture of our hybrid 
system. The system contains three "experts", a statistical 
module, a memory-based reasoning module and a neural 
network module, and a Combiner. The whole system 
produces secondary structure predictions as follows: given 
a set of amino a?id sequences (i.e. test data), each expert 
makes its predictions independently, then the Combiner 
takes the predictions from the 3 experts and combines 
them to produce final predictions. At the beginning, the 
hybrid system barns from the training data set about 
mappings between amino acid sequences and secondary 
structures. The training of the whole system involves 
(1) training the \ experts and (2) training the Combiner. 
How each expert is trained and how each makes predic- 
tions are discussed in the following sections. In order to 
train the Combiner, half of the training data is used to 
train the 3 experts separately, and the outputs of these 
trained experts c n the second half of the training data are 
recorded. These outputs are then used as inputs to train 
the Combiner. The reason for dividing the training data - 
set into 2 parts is that the behavior of each expert on 
training data can be very different from its behavior on 
the proteins v hose structures are unknown; their 
performance on :-he data that they are not trained on (the 
second half of the training set) reflects their behaviors on 
truly unknown protein structures, which is exactly what 
the Combiner should know about and be trained on. The 
training of the experts with half of the training data is 
done purely for the purpose of training the Combiner. 
After the training of the Combiner is completed, each 
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Figure 2* The hybrid system lias 3 experts, a statistical 
module, a memory-based reasoning module and a neural 
network module. The Combiner combines the outputs of 
the 3 experts to produce a final output. 



expert is trained again with the whole training data set. 
These trained experts together with the trained Combiner 
form a trained hybrid system. 

(b) Memory -baaed reasoning 

Memory-based reasoning (MBRf) (Stanfill & Waltz, 
1980) is one expert in our hybrid system. The essential 
idea of MBR is to use known examples directly in problem 
solving. For predicting the protein secondary structures, 
this involves matching each segment (window) of amino 
acid sequences in the test data set against all the 
sequences in the training set, finding its "nearest neigh- 
bors", and choosing the secondary structure state of the 
majority of its neighbors as the prediction. Similar 
approaches have been referred to as the "nearest neighbor 
method", 'exemplar-based reasoning", etc. Levin H al. 
(1 i)80) and Nishikawa & Ooi (1986) called this approach 
the "homologous method". The key component in this 
approach is the distance function or metric used to 
compute the neighbors. The choice of a metric is especi- 
ally difficult for elements such as amino acids, because 
there is no linear ordering among th? elements, which are 
often referred to as having "nominal values*'. Stanfill & 
Waltz (1980) proposed several distance functions for 
nominal values in their work on memory-based reasoning. 
We improved their functions and applied them to protein 
secondary structures in this work- 
Based on the idea of MBR, one distance matrix is 
computed for each position of the window using the 
training data set. At window position i, the distance 
matrix D i contains the distance between every pair of 
amino acids at that position. The distance between 2 
segments of amino acid sequence* A ~ a x a 2 ■ • «„ and 
B = b l b 2 . . . b n is defined as: 

D[A,.B)= t 

i- 1 

where n is the window size, Z)^,^) is the distance 
between amino acids a t and 6 ; at position i. The smaller 

t Abbreviations used: MBR, memory-based reasoning; 
TP. input pattern; SM, statistical module. 



this distance is, the more similar a,- and 6 ; are in tenjis of 
forming secondary structures, and the less effect it h;s on 
secondary structures if one is replaced by the other. The 
distance matrices D { can be computed from the training 
duta. Assuming there are in secondary structure statt s Sj. 
s :l , . . . s m and q different amino acids. x l . x~, . . . x* 
{({■ t bi € {x 1 , ... \ q }), D i {a lt b i ) is computed as: 



1 



^— tit x{) 



where xj denotes amino acid x h at window position k\ 

true- 
s the 
mino 
bility 
;s the 
mino 



■^{Sj\ai) is the conditional probability of secondary 
1 ure state s } given that a t has occurred; it represent 
influence on secondary structure Sj by the singleton i 
;tcid at position i. pis^, x k k ) is the conditional proba 
• )f 8j given both a, and \ k h have occurred; it represen 
influence on s } by a { together with its neighbor . 
acids. Thus when p{s^a i ) % pisjbi) and p(Sj\a it 
p(Sj\b h x£), a,- and 6 f are similar in determining seco 
structures, and />,(«,-, b t ) should be small, which is e 
what equation ( 1 ) yields. 



q) * 
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(c) A statistical method 

A statistical module (SM) is the second expert n our 
hybrid system. It works as follows: for each see* ndary 
structure state Sj, if the conditional probability of s given 
a window of n residues a, . . . a n , pis^^ . . . a n ), is f nown. 
then the Sj that has the highest value for this conditional 
probability is chosen as the prediction for nj ... a t : 



Prediction 



max ^(djlftj 



Sj e {a- helix, j5-sheet, • oil}. 
According to Bayes Theorem: 



p(Sj\a x 



(2) 



where p(Sj) is the probability of Sj and p{a l . . . a n \a t ) is the 
probability of a t . . . a n in secondary structure state sy, 
p{a l . . . a n ) is the probability of «j . . . u n in all states. 
Since we only want to find the largest ^(w-jrtj ...«„). 
p{a l . . . a n ) need not be computed. Currently theie is not 
enough protein structure data available for us to t ompute 
the frequencies of a^ . . , a n in each state Sj in ( rder to 
estimate p{a l . . . a n \8j). They have to be estimated by 
some simpler terms. We extend and apj ly the 
Bahadur-Lazarsfeld expansion (Bahadur. 196 1) here 
(which only deals with binary variables in its original 
form). Assuming that i/,, i/ 2! ... y n are random \ ariables 
with nominal values, then 

n 

p(y\ yn) = Y\v(yi) 

i 

Jl+ZV I Z*h+ - •}■ (3) 

i<k i<k<h j 

where Z ik is the second order correlation between y t and 

z ^ p(y it ilk) _ ! 
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aiui Z lJkh is the third order correlation among y t . y k and ij h : 



1 



2« = 



/>(i/r .Vfc- ?/*) 

p{!/i)Piifk)p{!/h) 



-1 



_ ( vky* Uh) x \ 

aur so on. 

1 1 practice. For the secondary structure prediction 
pro ilom. we can only estimate up to the second order 
cor elation* with the currently available protein structure 
dat 1. The reliability oF these estimates depends on the 
san pie size used. Thus, we postulate the Following 
ecji: it ion: 

* n p( a w * 

i 

wh< re f ik is proportional to the size of the sample in which 
rat o 

Pi<*f °k\*j) 
plaA8j)p{a k \sj) 

is 1 miputed. to represent its reliability: 



p{ai\*j)l>((i k \Sj) 



home observations about equation (4): (1) Compared 
wit 1 equation (3). correlations among 3 or more residues 
are ignored. This is due to the limited sample size. This 
tru ication may have an overall positive or negative efiVot 
on 'he contribution From higher-order correlations in tlu; 
api roximatioiv thus coefficient C f is introduced to 
cor pensate For this. C f can be experimentally deter- 
mit ed. (2) When there are no higher-order correlations 
am mg the residues in a window (i.e. they are all indepen- 

det t). a n \sj) is reduced to Pl l -2>(a l |.s j ), which is 

cor vet. (3) Information of all C\ possible pairs oF residues 
in ; window of size n is used here, whereas in a commonly 
use 1 statistical method, the GOR III method, only n - I 
pai s are used. (4) If the pairwise correlation terms -ire 
sm .'.11 and the approximation log(l ^ x is used, we :*et 
the following equation: 



(5) 




Output layer 



Hidden layer 



Input layer 



Figure 3. A one-hidden-layer feedforward artificial 
neural network. The network computes its output based 
on the values of the units at the input layer. 



network usually consists of a large number of simple 
processing units connected by weighted links. Each unit 
computes its output by applying an "activation function" 
to its inputs. The training algorithm used in this work, the 
Back-propagation algorithm (Rumelhart et al., 1986), 
works on a particular kind of artificial neural network, a 
layered, feed-foiward network (see Fig. 3). where the 
processing units are arranged Jn layers: there is an input 
layer, an output layer, and one or more ' hidden layers" 
(layers between the input and output layer). A feed- 
forward network computes its output in the following 
fashion: first, tin input layer is set according to an input 
pattern; then one layer at a time, from the input to 
hidden to outpu ; layer, the units compute their outputs 
by applying an activation function to the weighted sum of 
the outputs from the units at the lower layer. The weights 
come from the links between the units. The "sigmoid 
function" is often used in feedforward networks as the 
unit's activation function: 



Oi 



\ 



l + c" 



Where O i} is the output of unit j at layer i. and x is the 
weighted sum of outputs From units at one layer below: 



A- 



Th 5 is exactly the Form in LazarsFeld's original expans on 
(LfzarsFeld. 1961). which he derived From a completely 
difierent path. One advantage of equation (5) is tha^ it 
gu< rantees that the probability approximation is non- 
nejative. which equation (4) does not do. Equation (fx is 
the final form of the statistical expert used in this woi k. 



(d) Artificial neural network. 

artificial neural networks have been used widely in 
nuny applications (McClelland & Rumelhart, 19f'*G). 
inc tiding protein secondary structure prediction (Qian & 
Sej unvski. 1 988: Rneller et al., 1990). An artificial neural 



w hj-\ is tne weight of the link from unit k at layer i— 1 to 
unit j at layer i. This can also be seen as a projection of 
the network input to a certain direction specified by the 
weights. Thus, each hidden unit represents a different 
projection from the multiple dimensional input space to a' 
new space whose dimensionality is determined by the 
number of hidden units in the network. 

The Back-propagation algorithm "trains" a layered 
network by adjusting the link weights of the net using a 
set of "training examples". Each training example 
consists of an input pattern and an ideal output pattern 
that the user wants the network to produce for that input. 
The weights are adjusted based on the difference between 
the ideal output and the actual output of the network. 
This can be seen as a gradient descent process in the 
weight space. An "epoch training cycle" consists of 
presenting all training examples once to the network, and 
then adjusting the weights on the basis of the accumu- 
lated errors at the output layer. A number of epoch cycles 
may be required before the output errors are reduced to 
an accepted lev it. After the training is completed, the 
network can be t.pplied to inputs that are not in the set of 
training examples*. For a new input pattern TP, the 
trained network tends to produce an output similar to the 
training example whose input is similar to IP. This can be 
used for interpolation, approximation, or generalization 
from examples depending on the goal of the user. 
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Table 1 

Protein structures used in this work 



Protein 



Cytochrome c550 

Cytochrome B5G2 {E.coli, oxidized) - 
1-Arabinose-binding protein 
Actinoxanthin 
Phospholipase A2 
Cytochrome c5 (oxidized) 
Cytochrome c 

Calcium- binding parvalbumii B 
Crambin 

Subtilisin carlsberg (inhibitor) 
17/112 Ribosomal protein (C- terminal domain) 
Cytochrome c3 

Hemoglobin (erythrwruorin, deoxy) 
Elongation factor tu (domain i) 

Immunoglobin FAB 

FC fragment (1G01 class) 

Immunoglobin fc and fragment 15 of protein a 
complex 

Ferrodoxin 

Flavodoxin 
' Ferredoxin 

Glucagon (pH G-pH 7 form) 

y Crystallin 

Glyceraldehyde-3-phosphate dehydrogenase 
Glutathione peroxidase 

Oxidized high potential iron protein (HIP IP) 
Hemerythrin (MET) 
1 insulin 

Leghemoglobin (Acetate, M 3T) 
Lysozyme 

Myoglobin (DEOXY, pH 8-1) 
Immunoglobulin FAB fragment (MC/PC603) 
Melittin 
Neurotoxin B 
Pseudoazurin 
Plastoeyanin 

Hydroxvbenzoate hydroxylase 
Caleium-free phospholipase A2 
Avian pancreatic polypeptide 
Rhodanese 
Bibonuclease A 
Ribonudease Tl isozyme 
Subtilisin BPN 
Trypsin (RGT) 

Scorpion neurotoxin (variant 3) 
Tripsinogen complex with porcine pancreatic 

secretory 
Ttiose phosphate isomerast 
Tonin 
Ubiquitin ' 
a-Bungarotoxin 

Actinidin (sulfhydryl proteinase) 
' Acid proteinase, penicilloprpsin 
Acid proteinase (rhizopuspspsin) 
Azurin (oxidized) 
Cytochrome 155 (oxidized) 
Carbonic anhydrase form B (carbonate 

dehydratase) 
Cytochrome c! 
Cytochrome c3 
Chymotrypsinogen A 
Chy mo trypsin inhibitor 2 CI -2) 
Concanavalin A 
Cytochrome P450CAM (camphor monooxygenase) 2CPP 
Citrate synthase 

Cytochrome c peroxidase 20 Y I 

Gene 5 DNA binding protein 2CN5 
Hemoglobin (deoxy) 2 ! t J?» 
Hemoglobin V (CYANO.IV ET) 21 HB 

Lysozyme ^™ 
Cytoplasmic malate dehydrogenase 2MDH 



Codr 


Subunit 


length 


N'o. H 


No. E 


155C 




. 134 


35 


5 
0 


I SOB 




no 


67 


1ABV 




300 


100 


20 


1 ACX 




107 


0 


47 


1 BP2 




123 


54 


8 


.ICO 




83 


30 


0 


I ecu 




111 


44 


0 


1CPV 




108 


52 


0 


1CR\ T 




46 


20 


4 


icsk 




03 


1 1 
1 1 


22 


icnv 




08 


35 


18 


ICY 3 




118 


10 


0 


1ECD 




130 


97 


0 


1ETU 




190 


78 


30 


1FB1 


(H L) 


445 


11 


208 


1 FC i 


(A) 


200 


15 


05 


1 FC2 


(C) 


43 


21 


0 


1FDX 




54 


5 


4 


1 1 




147 . 


43 


32 


1 FX B 




81 ' 


10 


0 


1GCM 




20 


14 


0 


1GCR 




174 


5 


77 


1G1M 


(0) 


330 


73 


95 


1GP1 


(A) 


184 


43 


20 


1HIP 




85 


10 


0 


IHMQ 


(A) 


113 


73 


0 


11 NS 


(A D) 


51 


22 


3 


ILK I 


153 


107 


0 


1LZI 




130 


39 


10 


1MBD 




153 


113 


0 


1 M( ;P 


(H L) 


442 


8 


211 


1 Ml ^T 


(A) 


20 


22 


0 


1NXB 


02 


0 


20 


I PAZ 




120 


17 


44 


IPCY 




00 


4 


35 


IPHH 




304 


110 


00 


1 PJ>2 


(L) 


133 


48 


8 


I PIT 


30 


18 


0 


1RHD 




203 


81 


32 


1RN3 




124 


22 


48 


1RNT 




104 


17 


28 


1SBT 




275 


83 


49 


1SGT 




240 


21 


77 


1SN3 




05 


8 


12 



1TGS 

ITIM 

1TON 

1TJBQ 

2ABX 

2ACT 

2APP 

2 A PR 

2AZA 

2B5C 

2CAB 

2CGY 

2CDV 

2CGA 

2CI2 

2CNA 



(I) 
(A) 



(A) 
(B) 

(A) 
(A) 



(A B) 
(A B) 



57 
248 
238 

70 

74 
218 
323 
325 

129 

85 

256 
127 
107 
245 
65 
237 
405 
437 
293 
87 
287 
149 
104 
049 



9 . 
100 
10 
12 
0 

50 
30 
20 
13 
21 

17 

90 

27 
18 
11 
4 

180 
257 
134 
0 
197 
100 
109 
213 



11 

42 
71 
24 

4 
40 
147 
140 

41 

21 

"9. 
' 0 
10 
79 
14 
103 
41 
0 
10 
4 
0 
0 
14 
110 



105 
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Table 1 (continued) 



Protein 



CD. ZN Metallothionein (isoform II) 
Ovomucoid third domain 
Prealbumin (human plasma) 
Proteinase K 

staphylococcal nuclease complex 
rt T .ZX Superoxide dismutnse 
Strcpfomtfcex subtilisin inhibitor 
Satellite tobacco necrosis virus 
Tomato bushy stunt virus 
Cytochrome cool (oxidized) 
Adenylate kinase 
Bacteriochlprophyll 
Cytochrome c2 (reduced) 
Native elastase 
Ferredoxin 

Catabolitc gene activator protein-cyclic AMP 
complex 

Glutathione reductase, oxidized form (E) 

( a Icium -binding protein 

Pjmsi.hoglycerate kinase complex with ATP 

hosphoglycerate mutase DE-phospho enzyme 
Kat mast cell pretense II 
Kubredoxin 

Wheat germ agglutinin (isolectin 2) 

1KP aporepressor 

APO-liver alcohol dehydrogenase 

Aspartate carbamoyl transferase 

Carboxypeptidase Ay. (COX) complex 

iJinydrofolate reductase complex 

Ferredoxin 

Flavodoxin (semiquinone form) 

Mu tate dehydrogenase APO enzyme M4 

I rypsin inhibitor 

fi Tryp S i n diisopropyiphosphorvl inhibited 
Southern bean mosaic virus coat protein 
1 hermolysin complex 
Troponin 0 

Carboxypeptidase Aa (COX) 
Cat a hi se 

Papain CVS-25 oxidized 
Total 



Code 



Subunit 



2MT2 
2OV0 

2 PAH 
2PRK 
2SNS 
2SOD 
2SSI 
2STV 
2THV 
35 1C 
3ADK 
3UCL 
3C2C 

3 EST 
3FXC 

30AP 

30 HS 

3ICB 

3PGK 

3 POM 

3KP2 

3IIXN 

3WGA 

3WRP 

4ADH 

4ATC 

4CPA 

4DFU 

4KD1 

4FXN 

4LDH 

4PTI 

4PTP 

4SBV 

4TI.N 

4TNC 

5CPA 

70AT 

9PAP 



(A) 
(B) 
(C) 



(A) 
(A) 



(A B) 

(I) 
(B) 



(C) 

(A) 
113 



IF there is more than 1 subunit in a protein 
length md.cates the numberofresidi.es in th< 
residues in a-helix: Xo. E indicates the numb,, 
Table, with 113 subimits. 19.861 residues 



l/tmgth 


No. H 


Xo. E 


01 


0 


o 


50 


10 


9 


114 


8 


59 


270 


60 


(it) 


141 


20 


28 




0 


54 


107 


17 


26 


184 


18 


82 ' 


321 


4 


112 


82 


38 


0 


194 


100 


25 


350 


57 


170 


112 


44 


0 


251 


13 


82 


98 


7 


15 


208 


64 


21 


461 


132 


111 


75 


43 




415 


143 


40 


230 


09 


15 


237 


12 


83 


52 


(> 


8 


171 


10 


10 


101 


77 


0 


374 


79 




403 


133 


05 


37 


0 


(3 
50 


150 


29 


100 


18 


14 


138 


47 


29 


o»>»> 


lit 
1 1 1 


37 


58 


8 


14 


234 


16 


72 


222 


32 


72 


310 


117 


54 


160 


101 


0 


307 


111 


50 


' 498 


137 


71 


212 


49 


30 


19,801 


5324 


4098 



column Subunit indicates which subunitjs) w„ 5 used 
proten, sequences used. Xo. H indicates the number of 
r «f mudm, in /f-sheet. There are 107 proteins i„ this 



(e) Dalaba.se 

A database of 107 protein, was selected f„. m 
Bro khaven Protein Data Bank. It contains 19 86 
-es, u ,,,-J .subunit. All S ec Illei ,e 8 (subunit*? a^s 
tli.i , .,(> homologous with one another. The USHV 
program (kabsch & Sander. 1983a) was used to assign in , 
Herniary structure state of each residue. The DSftp 
program ass.gned 7 states. B, E. G. H. S. T and "the re,t" 
to . »e residues „, our database. For the purpose of this 
«or , H was couriered a helix. E was considered ft she, t 
and the rest were considered coil. Table 1 lists names of .ill' 
the proteins in our database. 

(f) Prediction accuracy measurements 
In this work, wo adopted the commonly used definition 
<>f -mlufmn accuracy, which is the percentage of 
meetly predated residues for the 3 type^ of second* rv 



Vi N 

where ,V is the total number of residues in the test data 
sets, q, is the number of residues of secondary structure 
type t (that are predicted correctly, se{a-helix. 0-sheet 
coil}. To measure the "quality" of the prediction on each 
type of secondary structure, Matthews' correlation coefti- 
cient was also used. For secondary structure type s, 

C, = (Ps-n s )-(v s -o s ) 

n/^'s + «,) " K + o,) ■ (p,+u,) ■ {p, + o s ) 

where p, is the number of positive cases that were 
correctly predicted; n, is the number of negative cases 
that were correctly rejected; o, is the number of over- 
predicted cases, and u s is the number of underpredicted 

3tf 8 f C0 ^ cie,,ts thus measure ^e differences of 
predictions for d fferent types of structures 



\ 



i 

1 
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Table 2 



Group 


No. 
residue 


Helix 
(%) 


Sheet 
(%) 


1 


2417 


20G 


2l-:> 




2465 


28-1 


10 o 


3 


2550 


27-5 


22-7 


4 


2450 


251 


20 7 


5 


2492 


260 


20- 1 


f> 


2476 


23-7 


20- i 


7 


2507 


27-2 


21 0 


8 


2504 


27-4 


10 4 



Proteins and their maximum \ Yer . , f 

homology with other proteins (%) maximum -o,, 



o (39 ' 5) • 2MT2 (3*8J, 1FXB(346), 2B5C<32-9) 3C<>C<43-7) 
2SNS(3]-2), 2LHB(29 5). 1MBD(»4). 1 ETU(240 -MCTV^-O ' 
lMCP-H(42-8), .SBTCW),. 2MDH-B(375, 3 CLU^ 
OON(44-8). II>rT(38 !»), 4PTI(34-5). IUBQ<382), 3WRP(31-7) 
15H«(30-0), 2PAB-A ( :;I G), lCY3(30-5), 4FXN<31 2) 2STW6 6) 
3RP2-A(31-2), ITON(37-4), 2CAB(24-2 , 4LDH(2 10) ''CTSH 9 i 
.IN^D(40.0), 2CI2-K36-9). ISN3 32-3 , >PCY< ] £ ^ 

JOAP-A(24o), 1FB4-H(4I 5), 4PTP(4I-9) •'CYPf'l-Rl 
2APR(39-1), 3GR8(lS-4) • 

PAZ(32-5), 1ECD(28 7), 2LZM(26-2), 3WGA-B(24-6) 
iSKh, 3 '' 2CGA - A < 40 °)- lABP(22-5), 2TBV-C(22-7), 

fJ™P"). «CTF{39-7), lRXT(29-8), 2CDV ( 29-9), 
o,» 3(27 ; 4) ' 11 P2 - L ( :l8 -3).4ATC-B(301), IGPI-A(25-5) 
4S>BV-G{27-0), 3EST(:i.r5), 5CPA(21-8), 4TLNCI-8) 3Pf!Kf>n.=;> 
3RXN ( 38- 5) , ,FDX,37<)). 3ICB(360), 2GnJ \c »V ^2, ' 
1LZ1(209), 2HHB-A,42-6), 2SOD-B(28-5) 1TC1- W-) 

4CPA-I(35I), 1CSE-I<349), IC05(33-7), 4FDl(311) '>SSI(3*-7) 
' v^'.t'' 1FX, < ;!7U )- 4WK-BI270), 3ADK(284 ; 3P M>6.'l, 
2GNA(24-9), IRHD(2*5), 2APP(39-3), 2CPP(I9 o) 
INS-A(47C), ITGS-K351), 2ABX-A(3fi-5). lHIP(32-9) 
ACX(36), 2CCY-A(M4-6), lo5C(3G-6), 4TNC(281), K!('r(->4-7> 
lCAT \!n\) 1TIM - A(24 - 2) ' 2l> RK(35-5,, 2MDH-A,37-7, ^ 



34-2 



31*5 



32-7 



31-5 



301 



31-8 



30-4 



33-4 



subunit 0 of 1FC2.) The number in^:^^^ ~ ^ ^ < SUb ™ ts > * -eh test group. (1 VC2-C 

sequences in other groups (i.e. the training da I" t for ^^^^ ---urn homology between that a kI all 

each group. * 1 - at test ^ rou l > The last column is the average of the maximum homol. gy of 



(g) A measure of statistical significance 

When comparing different prediction algorithms we 
need to know whether the differences in prediction accu- 
racy among them are statistically significant. Statistics 
theory give* us a method to compute the "significance 
interval' lor the difference between 2 population propor- 
tions (Daniel, 1087). In the case of st condary structure 
prediction, the ^proportion" is the pen-erttage of residues 
in a set of test data whose secondary .structure state has 
been correctly predicted. Assume the prediction accuracy 
of 2 algorithms are Pl and p 2 for 2 test data sets of r. and 
r 2 residues, respectively, and the test .lata are randomly 
selected, then we say that we are a x 1(0% confident that 
the accuracies of the 2 algorithms are really different if 



where: 



(6) 



* is the inverse cumulative normal distribution For 
r^'"» ,le ' when ««<M». 2(l+«/2)= if rx ' - 

-0,000, the significance interval is / % 090/ . \t , 
4000 / * 21 % . Tf we choose 0 = 0 . 9!) /- ^~J2 - 

theiv/ * |-2%. T | U1S , the bigger the difference bet we'en * 
prediction accuracies, the more significant it is For the 
same difference, the more test data »«<], the more signifi-' 
rant it is (and the more confident, wo are). Kqiwtion (0) is 
used in tins paper to determine whether the difference in 
the accuracies of 2 different predictions is statistically 
significant. J 



3. Experiments and Results 

(a) K-way cross-validation 
^ To evaluate the hybrid system, all the protei is in 
our database were randomly divided into right 
£ roups. In each test, one group of proteins was used 
fs the test data set and the rest as'the training data 
set. The whole experiment consisted of eitjht such 
tests, i.e. eight independent runs of the hybrid 
system, each time on a different test data set ' Thi* 
way, there was no overlap between training ,rlata 
and test data, and every protein was used a* test 
data once. This is the so-called "k-wav cross-vr lida- 
tum" testing procedure. Table 2 lists the pre reins 
and the number of residues in each group, the a 
helix and fi sheet contents in the group, as w. 11 as 
t he degree of homology between protein 4 in 
different groups. 



(b) Window size and other choices 

Throughout this work, a window size (f 13 
residues was used. Each expert looked at 13 res dues 
at a time and predicted the secondary stru iture 
state of the center residue in the window. The 

< Combiner looked at the predictions of 13 res dues 
irom each expert and made a final prediction f < r the 

< enter residue. For each amino acid sequence i i the 
vest data set, the window was moved over the > hole 
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sequence, and a prediction was made for every 
resic ue. 

Tl ere were other choices that had to he mach 
befo e starting our k-way cross-validation ex peri - 
men with the hybrid system. They included (1) the 
mini )er of hidden units and the number of training 
cych s for neural networks; (2) the threshold for 
' nearest neighbors" in MBR module; and (3) the 
coefi cient Cf in the SM. If these choices were made 
arcoxling to the system s performance on the test 
data set. then they might be fine-tuned to fit the 
part cular data set and make the system's accuracy 
appear higher than it really is. To avoid this, prior 
to tie k-way validation experiment, a "pilot set" of 
20 p; oteins was randomly chosen from the database, 
and the above choices were made based on the 
systt m*s performance on this pilot set. (The pilot set 
conssted of: 1TXS-A. 3RXN, 2MT2, 1CTF, 351C\ 
201)/. 1HMQ-A. 1RN3. 1PP2-L, 4FXN. 2S0D-B, 
1MBD. 1GP1-A, 1FB4-L, 4PTP. 1TON, 2PRK, 
4ATJ-A. 4LDH, 1PHH.) 

(c) M BR and SM: training and prediction 

In Memory- Based Reasoning module, first the 
dista ice matrices were computed using the training 
data set. There was one distance matrix for each 
posit on of the window, see equation (1) for details. 
Thei: for each segment (window) of the amino ack 
sequt nces in the test data set, b li b 2 , . . - b n , the top 
25 ir stances in the training data set that had tho 
shortest distance to it were considered its neighbors 
The strength of prediction (score) for each 
secordary structure state was the percentage of 
neighbors in that state weighted by the inverse of 
their distances. The structure state that had the 
higlnst score was taken as the prediction by MBR. 

In the statistical module, the frequencies of 
singh tons and pairs of amino acids within a window 
a x . . a n were calculated for each structure state s, 
in th 1 training data set, to approximate the condi- 
tiona probabilities ptaja^s and p(r? i; aJ.Sjjs. Then fo) 
each segment of amino acid sequences in the test 
data n»t. b { . b 2 * ■ . - 6„, these probability values were 
used to estimate the probability p(s j \b l ,b 2 , . ■ . b„) 
accor ling to equation (5) (C f = 1*5 was used in this 
work , where Sj is one of the secondary states 
(ot-he ix, ^-strand and coil). The value o\ 
p(Sj\b Jj 2 , • . b n ) was taken as the score of predic- 
tion :or structural state Sp and the state that had 
the h sliest score was taken as the prediction by SM* 

(d) Training neural networks 

On 1 important issue in training neural networks 
by t ie Back-propagation algorithm is deciding 
when to stop training. If a network is trained 
through too many cycles, the network tends to 
menu rize the training examples but generalizes 
poorl;* on the inputs that it has not been trained on 
(i.e. est data). One practice is to monitor the 
perfoi mance of the network being trained on the 
test cata. and to stop training when the perform- 
ance peaks. This strategy cannot be used in real 



situations where the true answer is truly unknown. 
We used the following techniques to solve this 
problem: (1) lim ting the number of training cycles; 
(2) limiting the number of hidden units, thus the 
number of free variables (the "memory capacity") 
in the network; (3) when available, using a separate 
control data set ,o control when to stop training the 
network, that is, to monitor the performance of the 
network being trained on the control data set and 
stop training when the performance peaks. 

A one-hidden- layer neural network was used as 
one of the three experts. This network is referred 
to as EXPERT -NN in the following discussion. 
A total of 21 input units was used to encode one 
residue, one unit for each of the 20 amino acid types 
plus one end marker. With a window size of 13 
residues, there ware 21 x 13 = 273 input units total. 
EXPERT-NN had three output units, one for each 
of the three secondary structure states (cc-helix, 
j3-sheet and coil). The network had only two hidden 
units. EXPERT NN was trained up to 200 epoch 
cycles on the training data set, and the network 
weights that gave the best performance on the 
training set during training were saved as the final 
result of training The activation of the output units 
were used as i.he score of prediction for the 
corresponding secondary structure. 

The Combiner of 1 our hybrid system was also a 
one-hidden -layer neural network. The Combiner 
took the outputs of the three experts as inputs and 
made final predictions based on these outputs. For 
every residue, ea^h expert generated three numbers 
representing the orediction score for a- helix, j3-sheet 
and coil, respectively. The Combiner took the 
predictions of K'» residues from each of the three 
experts as its ir put, thus it had 13x3x3= 117 
input units. It also had three output units, one for 
each of the secondary structure states. As discussed 
in Methods and Materials, in order to train the 
Combiner, the training data set was divided into 
two halves, which will be referred to as {H t } and 
{H 2 } in the following discussion. The three experts 
were first trained on the first half of the data set 
.{H!}. Then they were applied on the second half 
{H 2 }. Their outputs on {H 2 }, (Output(H 2 )}, were 
then used as input patterns for training the 
Combiner. Similarly, the three experts were also 
trained on {H 2 } and their outputs on {Hi}, 
{Output (Hi)}, were recorded. Finally, the Combiner 
was trained up to 200 epoch cycles, using 
{Output(H 2 )} as training data and {Output(H t )} 
as control data. The weights that gave the 
best performance on both {Output(H t )} and 
(Output(H 2 )} during training were saved as the 
result of training :he Combiner. A total of 30 hidden 
units was used in the Combiner. Since there was a 
control data set .iere, the number of hidden units 
was less crucial fore than in EXPERT-NN. 

(e) The hybrid system improved prediction accuracy 

Table 3 shows the results for the eight test data 
sets in our k-way cross-validation experiment. Table 
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Group 



1 

2 

3 
4 

5 
6 
7 
8 

Total 



No. 
sequence 



14 
15 
14 
14 
14 
14 
14 
14 

113 



Table ; 

Prediction accuracy on test data sets 



No. 
residue 



2417 
2465 
2550 
2450 
2492 
2476 
2507 
2504 

19,861 



EXPERT-NN 
(%) 



60-7 

62-2 
62:1 
63 :» 
65*2 
623. 
649 



63- 1 



SM 
<%) 



62- 8 
63*3 

63- 6 
62-9 

62- 4 

64- 1 

63- 8 

65- 5 

63-5 



MBR 

(%) ' 



64-4 

63- 9 

64- 7 
640 

64- 4 

65- 8 
63*1 
65-5 

64-5 



Hvbrid 



65- 3 
60-3 

66- 2 
66-2 

66- 6 
681 

05- 1 

67- 5 

06- 4 



The prediction accuracy 011 each test data set by the 3 experts and by the hybrid system No 
^roup F *" |UenCeS (SUbUnitS) " eaC,i gr ° UP: N °" reSidUC ^ mUnber ° f reSi(lueS in 



I 



4 shows the accuracy for each sequence. Overall, for 
the prediction of secondary structures a-helix, 
/?-sheet and coil, EXPERT-NN was 631 % 
accurate, MBR was 64 5% and SM was 63*5%. The 
hybrid system was 66-4% accurate. The total 
number of residues used in the* experiment was 
19,861. According to the statistical significance 
measures described in equation <6), the. improve- 
ment of the hybrid system over each expert was 
statistically significant (with higher than 0*99 confi- 
dence level). Thus we are highly confident that our 
hybrid system really improves the prediction 
accuracy. 

The Matthews' correlation coefficients for each 
expert and for the hybrid system are shown in Table 
5. Air three experts had similar coefficients and 
produced better prediction on a helix and coil than 
on P strand. One reason for this might be that a 
single p strand can hardly be stat'le; more than one 
strand get stabilized when they interact with one 
another to form a sheet; this interaction is often 
not local along the sequence and thus cannot be 
captured very well by the local approach. Thus, no 
matter what algorithm is used, /? strand would still 
be the most difficult state to predict. The hybrid 
system improved the prediction for all the structure 
states. 

(f) A single small test data set is dangerous 



From Table 3 we computed the 
in prediction accuracy among t 
experts for the same sets of tes 
0*9%. This shows that the overa- 
three experts were very close. We 
average difference for the same e 
different test data sets, which \\ 
each test data set is observed. i 
difference in prediction accural 
different test data sets were at It 
difference brought about by the 
This observation argues strong] 
single small test data set: (I) ' : st 



average difference 
he three different 
j data, which was 
1 accuracies of the 
also computed the 
xpert on the eight 
as 1-3%.. Thus, if 
independently, the 
;y caused by the 
■ast as large as the 
different experts, 
y against using a 
Mistical noise" can 



Table 4 

The accuracy on each protein sequence (subuni ) by 
the three experts and the hybrid system 



Protein 



1 550 

IABP 
IACX , 
IBP2 
1CC5 
ICCK 
1CPV 
I CRN 
1GSE-I 
ICTF 
1CY3 
1ECD 
IETU 
4IFH4-H 
IFJU-L 
IFC1-A 
1FC2-C 
1FDX 
1FXI 
1FXB 
IGCN 
I OCR 
IGD1-0 
1GP1-A 
I HI P 
1HMQ-A 
1TNS-A 
liNS-I) 
1LH1 
1LZ1 
1MB!) 
1MCP-H 
1MCP-L 
I M.LT-A 
1NXB 
I PAZ 
1PCY 
1PHH 
1PP2-L 
1PPT 
1RHD 
1RN3 
1RNT 



SM 
(%) 



64- 9 

62- 7 
59-8 
701 
520 

75- 9 
670 

63- 9 
50-0 

65- 1 

55- 9 

68- 6 

44- 9 
08-9 

71- 6 

66- 2 

56- 8 
74-4 

72- 2 
52-4 
82-7 
55-2 

45- 4 

57- 1 

64- 7 
04-7 

69- 9 
47-6 

76- 7 

70- 6 
73*8 

67- 3 

64- 4 
650 
42-3 
67-7 

59- 2 

58- 6 

60- 4 

65- 4 

77- 8 
64-2 
548 
64-4 



MBR 

(%) 



74- 6 
61-8 
53-3 
63-6 

55- 3 
72-3 
70-3 
55 6 

56- 5 
68-3 

57- 4 
70-3 
43-4 
714 
75*5 
70-8 
60-2 
76-7 
79*6 

■ 57*8 

75- 3 
41-4 
52*9 
60-1 

60- 3 
67-1 
56-6 
38*1 
73*3 
61*4 

66- 9 

67- 3 
75-7 
70*0 
50-0 

61- 3 
65-8 
64*6 
581 
70-7 
83-3 
66*2 

62- 9 
61 5 



EXPERT-XX 



64*9 
70*0 

57- 8 
60-7 
520 
699 
70*3 
60*2 
500 

63- 5 
50*0 

75- 4 
36*8 
70-9 
65*9 
611 
60*2 
OO-o 
70-4 
57* 1 
70-4 
48-3 
56*9 

64- 3 
65*8 
60-0 

58- 4 
47*6 

76- 7 
68-0 
70-1) 

60- 1 

61- 3 
61 -4 
500 

66 r 

63*3 
65*7 
66*0 
66*9 
750 
64-8 
58 1 
68-3 



H\hrid 



: )-9 

( t-5 

r 7-5 

< 1-7 
; 2*8 
171 
7 3-0 
t 6*7 
.- ;>..> 

: i -4 

. ; 4*4 

: 4 6 

- 5-6 
■ 7*0 

1-2 
« 81 
. -8-3 

91 



'7-8 
• .8-3 
vi2-9 
i3 1 
15-2 
iO-O 
>3*7 
12-9 

i;i*3 

/2-5 
iS-5 
58-0 

)91 
162 
>2*9 
36*7 
;>7*7 
64-2 
692 
88*9 

66- 2 
66*9 

67- 3 
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Table 4 (continued) 





SM 


MBR 


EXl'ERT-NN 


Hybrid 


Prote i\ \ 


■Co) 


(%) 




(%) 


1SBT 


03-3 


08-4 


GO* 5 


07*3 


1SOT 


00" 


75-4 


05*0 


783 


isx:t 


73-8 


70-9 


70*8 


72-3 


rrt;s i 


(»:t 2 


57-9 


00*7 


50- 1 


itim a 


70-0 


(Hi- 1 


09*0 


73-8 


1TON 


70- "J 


78*2 


09*7 


750 




01*8 


55-3 


05*8 


03-2 




89' 2 


78*4 


811 


81*1 


■> 


08'3 


7 20 


07 0 


73*4 


o V PT 


01-9 


57*9 




59-4 




09-2 


08' 9 


00- 8 


08- 3 


2AZA B 


45*7 


51*2 


48- 1 


48- 1 


_ 1>.H 


or>9 


07- 1 


63-5 


553 


•M ' \ l< 
_l A I ; 


04' I 


09*5 


7 11 


09*5 


•'(.'( 'V A 


75' 0 


63*8 


79-5 


83-5 




7 "2*9 


7 3- 8 


71-0 


7 00 


-l. WA ■ A 


04'9 


747 


584 


72-2 


_v I _ 


OO'O 


708 


04- 0 


70*8 




57-8 


5 80 


00-3 


59- 1 


■WPP 

-V I 1 


09 0 


01*5 


01*5 


05-9 


1 > 


no \ t 


U— ») 


*U4 I 


G91 




03*8 


03-5 


00*1 


04*8 




09*0 


05' 5 


02* 1 


70* I 






"7(l-*> 


00*0 


78*0 


-ran., i-n 


O I U 


-*iO-fi 




01*0 


•)1 H I 


Uo ■> 




00' 4 


67*1 


or v\ 


in * 4 


ft i .ft 

U4 U 


010 


0 10 


-Al Ui 1 - A 


- = . .) 

•>•> - 


-.T.I 
*> / 4 


508 


01*7 


■> \ 1 1 \ ' I 1 1 
JA1 m l-l> 


4n t i 




4 1 1 


52*0 


-At 1 - 


im. 1 
;M 1 


.11 0 


Q- 1 

yo 1 


90- 7 


lIO\ C 


1)- •> 


U4 0 


fin- 7 


00*1 


,){) \ 1 \ 

.1 Al -A 


.)()■ 0 




**iO-(i 


58* 8 


2PR1 


03-8 


IU1Q ' 


uo 0 


1 1 0 




ii \ .i\ 
() 1 u 


OU 0 


l\ \ -7 


uo 0 




f 'if i* 1 




72*2 


715 




J 4 o 


09-2 


72*0 


78*5 


ovrv 


-1.1 
Oil 


00 0 


^1 ■ l 


53*8 


.(Tin f 


Ml--* 


020 


00* 1 


04*5 


.Mil 


o 1 * 


70*8 


793 


80- 0 


.tA 1 "1 . 


U-Jr -f 


fW(l 

00 u 


• 01 -9 


09- 6 


*ilU*l 
olH 1 


J.0-7 


4U < 


50 0 


44*7 




J I 4 


S9-1 

"5^ I 


on 0 


08- 7 






/ t ' t 


fi4*9 


79-3 


't l? V f . 


* - 4 


1 .'t*.') 




770 


'It 1 \ 1 A 


OU O 


fifl-1 
OU I 


"U-8 

f>4 O 


50*7 


.>l»i\i 


.>0 o 


00* / 


02*7 


03*3 


.i 1 1.' t> 


*>.) O 


0.1 0 




90*7 


•*Pf *I 
•>l lit 


(Will 


06*0 




67*7 


.1 Kith 




07-8 


07*4 


08*3 


o I\ I - A 


o4 - 9 


00' 7 


55*3 


02*0 




82*7 


84*0 


84*0 


84* 0 


3\Y(i \-B 


SO- 1 


779 


80*7 


80-7 


3YVR J * 


7 - v o 


70-3 


00*3 


73*3 


4 A 1)1 1 


o72 


54-3 


59-4 


57*5 


4AT( - A 


58' 4 


01*3 


01-0 


03-5 


4AT( -H 


02-1 


01-4 


64- 1 


02-7 


4C1V -I 


78*4 


070 


730 


70*3 


4 OKI -B 


59- 1 


58*5 


59*7 


02*9 


4FD) 


07-9 


730 


75*5 


720 


4 fx:; 


081 


03*8 


64-5 


68*8 


41,1)1 ( 


59-8 


58*3 


592 


61-3 


4PT1 


70-7 


00*3 


70-7 


02- 1 


4 PIT 


71-4 


82-5 


08-8 


77*8 


4SB\ -C 


53-2 


54* 1 


57*2 


55-9 


4TU: 


57*9 


02*3 


582 


05*8 


4TX<: 


83-7 


■ 781 


77-5 


794 


51 iV 


00-9 


03-8 


03-8 


00-4 


7CA1 - A 


05-7 


04-5 


05*5 


04-9 


mwr 


70-3 


80-7 


70-3 


76*4 


Tut a 


035 


04*5 


631 


60*4 
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and the hyb 


Table 5 

correlation coefficients for each expert 
rid system on each structural state 


Method 






SM 
MBK 

BXI'EKT-NN 
Hybrid 


0*390 0*418 
0*390 0*410 
0*395 0*383 
0-429 0*470 


0-350 
0*357 
0*333 
0*387 


The percentage 
produced the 


Table 6 

0/ total residues for which tivo experts 
sivnie secondary structure prediction 


EX PERT- NN 


MBR SM 


Hybrid 


EXPKRT-NN 

MBR 

SM 


70*0% 84-3% 
77-7% 


82-9% 
S20% 
830% 


Table 7 

Percentage accuracy 


One correct Two correct Three correct 


Three incorrect 


70*0% 


0.4-0% 50-0% 


10-4% 



make the same algorithm have different accuracies 
on different test data sets if the sets are small; 
(2) the difference among different algorithms, even 
if it truly exists, can be easily "buried" by such 
statistical noise. Thus, large or multiple test data 
sets should be used whenever possible. 

(g) Different algorithms made similar predictions 

The three experts used in our experiments did not 
only have similar overall prediction accuracies, but 
also made similar predictions for each sequence. 
Table G shows the percentage of the residues in the 
test data sets fcr which different experts produced 
the same predictions. On average, each pair of 
experts agreed w ith each other on about 80% of the 
total 19,8G1 residues. All three experts produced the 
same prediction on about 70% of the total residues 
(not shown in th3 Table). Table 7 shows the percent- 
age of residues for which at least one expert was 
correct, at least two experts were correct, all three 
experts were coirect and all three experts gave the 
same but wrong predictions. For about 20% of the 
residues, all thr?e experts produced the same but 
wrong predictions. This, together with the informa- 
tion from Table 6 indicates that the "local rules" 
(the rules mapping short segments of amino acid 
sequences to secondary structures) obtained by the 
three very different experts were actually quite 
similar, but they did not apply quite as well to the 
test data. This raay suggest an upper bound on the 
secondary structure prediction accuracy based on 
local information from the currently available data. 
In places where all algorithms were the same but 
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Table 8 



Accuracy of predictions 






a Helix 






ft Sheet 






Method 


Correct, 


Over 


Under 


Coef. 


Correct 


Over 


Under 


Coef. 


SM 
MBH 

EXPKRT-NN 
Hvhrid 


345 
300 
314 
353 


195 
182 
181 
162 


171 

207 
202 
103 


<K«»2 
0-341 
0-351 
0-445 


440 
404 
421 
450 


200 
281 
311) 
234 


375 
420 
403 
374 


0-283 
0-244 
0-238 
0-335 



(sec- Methods and Material:)). 



incorrect, the structures might be determined by 
non-local interactions. Among the residues where all 
three experts did agree with one another, they were 
correct for 71% of the residues. Thus, if we only 
consider the cases where all three experts agreed, we 
have a much higher prediction accuracy. 

(h) Homology between training and test data set 
It is known that if the training data and the test 
data are identical or highly homologous, then the 
prediction accuracy could be mislead.ngly high. 
However, when the degree of homology between 
training and test data was below o0%, we did not 
find strong positive correlations between the predic- 
tion accuracy and the degree of homology. For 
example, the degree of homology between 1GGN, 
IMLT-A and 1TNS-A and their tiaining data were 
44-8%, 4(5-2% and 47-6% respectively, and I then- 
prediction accuracies were quite low (see Table 4), 
whereas 2CTS, 3CRS and 7CAT-A had very -low 
homology with their training data (19 2 /„, IX* /„ 
and 17 1%, respectively), but their prediction 
accuracies were much higher. 



(i) Secondary structures as iiulividual units 
Often it is more important to predict correctly the , 
occurrence or absence of a secondary structure (a 
helix or P strand) as a whole rather than just to 
predict, the states of individual residues. Thus the 
following criteria were also used in this work to 
evaluate the predictions of different methods: we 
took an a helix or ft strand as r.n individual unit, 
and checked how many of these secondary struc- 
tures were correctly predicted (positive cases), how 
many of them were not predicted at all (under- 
predicted), how many were predicted which do not 
exist in the real structures (ovei predicted), lhen a 
Matthews' correlation coefficient is calculated tor 
each method. We found that the hybrid system had 
the most positive cases and the fewest overpredic- 
tions and underpredictions. (Note that this is in 
terms of number of secondary structures, not 

residues.) . . . , 

Specifically, in this work an a nehx is said to have 
been predicted if at least four co ntinuous residues m 
a 'sequence are predicted to be in H state; a p strand 



is said to have been predicted if at least two c on- 
tinuous residues were predicted to be in h stab*, it 
the overlapping region between a real secondary 
structure and a predicted secondary structure ot 
the same type is greater than half ot the length 
of the real structure or the predicted structure, 
then the real secondary structure is consider* 1 to 
have been correctly predicted. Tf more than one 
Predicted secondary structure overlaps with one 
real secondary structure, only one of the pred eted 
secondary structures is considered as a co wet 
prediction, and the rest are counted as overpr :<1.e- 
tions If one predicted secondary structure ove laps 
with more than one real secondary structure, only 
one of the real secondary structures is consider id as 
correctly predicted, and the rest are counted as 
underpredictions. Table 8 lists the correct p.ed.c- 
-ions overpredictions, underpredictions and 
Matthews' coefficient for a helix and />' strait! by 
• iac h expert and the hybrid system according to 
these criteria. (In calculating ' Mat t hews . oefh- 
oients. the residues between 2 helices (sheet.) are 
considered to form 1 non-helix (non-sheet), Ihe 
hybrid system produced the best result b> this 
criteria as well. , 
No doubt the above criteria are not perfect And 
the details such as the numbers 2 for P strand and 4 
."for a helix are to some extent arbitrary. Hon -ever, 
' we need some criteria to capture the intuitive notion 
of "how many secondary structures are pre beted 
correctly". We believe the above criteria sei res as 
an unbiased, first-order approximation to 1 at. It 
provides a new perspective to eva uate d. feren 
prediction methods. For example, SM.is bett V than 
MBR and EXPERT-NN by this criteria, wnerea* 
that is not the case if we count the mimaer of 
correctly predicted residue states (see I able • ). 



(j) An example 

Figure 4 shows the prediction for protein 1PAZ 
by each expert and the hybrid system. It ill. Urates 
the points discussed in previous sections No ee that 
the inputs from each expert to the Combinei m our 
hybrid system are the three- prediction sc. res for 
each of the three states (a helix, fi sheet and coil), 
not just the predicted states themselves; rnd the 
Combiner looks at the prediction scores of A posi- 
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at a time That is why we can see that 
witmn cases the Combiner can override th, 
majcnly of the three experts, such as between 
resKJ .e 0 and 10 of 1PAZ. In some places, all hre 

T hUt n Wnm « P^ictions Fc" 
40 ! 1 -r T 18 8 Sh ° rt P Strand betw een residue 
40 a>,c oO.that none of the experts predicted: an, I 
hey ill predicted a helix between residue 50 and 60 

t ZT , eX ' St *! le Teal Stn,Kture - Tn botl, 
\on of 1 0,nb " ,er ;" ade the same mistake also. 
Aon« of the experts could always make better 
pred ctu,, w than others. For example, SM is theonlv 
one f ha predicted the sheet between residue 20 and 
JU. J UK is the only one that did not give the falsi- 

xx helix , bet r n residue ^ oo.t 

(k) Comparison with other methods 

neSlwk" S j ?° W8ki (,988 \ U8ed a "««c*ded neural 
netu >,k Mstem in secondary structure prediction 



and achieved W3% accuracy on a test set of 1 5 
pro ems (containing 3520 residues). Their system 
contained two networks: the first network took 
amino acid sequsnces as inputs and produced the 
initial prediction; the second network "cleaned up" 
this imt.al prediction to produce final predictions. 
J h.s system could also be seen as a hybrid system 
bu t with only one expert. We applied their method 
to our eight test data sets. Table 9 shows the results 
Thw was done not only to compare the final results 
but also to see whether adding two more exnerts 
could really help. The overall prediction accurac^f 

fi4 e no/ aSCa u ded K SySt6m ° n our test data sets was 
Swn wh 1 lch \ on a much ,ar g er scale (19,861 versus 
3520 residues), confirmed Qian & Sejnowski's 
results However, the improvement of the cascaded 
network over a f ingle network was only 0 5%, not 
1 o % as reported ,n their paper. According to our 
statistical significance measure (equation (6)), both 
0-5o/ o for 19,861 residues and 1-5% f or 3520 residues 
were not statistic-ally significant differences at confi- 
dence level 0-95. We also noticed that there was 



Table 9 

The accuracy on the eight test data tats by Cascaded 
networks of Qian & Sejnowshi (19S8) 



Prritein Secondary Structure Prediction 



061 



Orouj) 


Xo. 
sequence 


' No. 
residue 


Single network Cascaded network 
<%> <%) 


1 

•> 

3 
4 

5 


14 
!f> 
14 
14 
14 


2417 

— •tVIt 

2550 
2450 
2492 


01- 9 6 2-5 
04-3 04.3 

02- 5 (i3 . 2 
f >- 7 . 02-9 

04-3 


(i 
7 
8 


14 
14 
14 


2470 
2507 
2504 


05-5 fi6 .fj 
020 (J2-9 
U5-3 05.5 


Total 


1 1 :s 


19.801 


03*5 (i4- 0 



some difference in prediction ' sccuracy (0-4°/) 
between their single' network and our 
• EXPERT-NN. even though they were both trained 
and tested on the same data sets. The reason was 
that according to Qian & Sejnowski's method, the 
performance of their network on the test data set 
was monitored during training. The network 
weights that performed the best on the test set were 

Ktolr 1 ' Whereas in ™* "°rk, the 
K\l KR1-NN never saw the test data set during 
training (see Methods and Materials). 

The GOR III algorithm by Gibrat et al (1987) 
was reported to have achieved 63% prediction 
accuracy by us i ng correlations between certain pairs 
noZT a f ds anfl secondary structures.. Biou et al. 
(1988) further improved the GOR TIT algorithm by 
combining its result with that of two other algo- 
rithms the Homologue method ar.d the bit pattern 
method, achieving a reported accuracy of 65-5°/ 

oor^i- u ; • t, : i8 ; co,nbined al « orithm «s 

H onr/: lne J hn the followin g discussion). We ran 
the GOR-Lomb.ned program on -protein sequences 
m our database. Since their program contained the 
statistics calculated using their database, i.e their 
"training data, we divided our database into two 
groups Group A contained sequences that were 
identical or more than 50% homologous to their 
training data. Group B contained the rest of the 
sequences. There were 64 sequence in group A and 

tLXtT"* J" gr ?' P B - A PParent!y group B 
slm Id be used as the test date to compare the 
GOR-Gombined against other algorithms, because a 
prediction algorithm could easily have a very high 
prediction accuracy on protein sequences that are 
either identical or highly homologous to its training 
data winch cannot be used as an objective assess 
ment of the algorithm's prediction accuracy For- 
group B the GOR-Combined was 6*4% accurate. ' 

18 3% lower than their reported result. One 
reason for this might be that GOR-combined algo- 
rithm used certain rules to combine the outputs of 
different methods, and those rules did not work 
quite as well for proteins not in its database We 
used the w. protein sequences ir, group A to train 
our hybrid system and applied h; to the 49 protein 
sequences ,n group B. It was 65-3% accurate. This 



Table 10 

Accuracies of different algorithms for three MuC- 
(helix, sheet, roil) prediction 



Method 



I.im (1074) 

Cliou & Fasman (1974) 
l. jvin et al. (1986) 
GOR III 

Qian & Sejnowski (1988) 
Kolley & Karplua (1989) 
Hybrid 
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tii-i 
03 
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J" about 1 % lower than the average accuracy o 
l-ybncl system in the k-way cross-validation' ex 
Ment. We believe this was due to the sir 
framing set used here, which had onlv 64 re- 
sequences. ' 1 
Table 10 lists the results of several other 
nthms. The results were obtained from 
author's original report except those bv Lj,„ ( 
and Chou & Fasman (1974), because in 'their ori 
reports they used the same data set for 
framing and testing. Kabsch & Sander (1< 
assessed the accuracies of these two algorithms 
separate test data, and the results were includ 
Hie Table instead! Among these, our hvbrid sx 
was tested with the largest set of protein data • 
gave the highest prediction accuracy 



'the 
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each 
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both 
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with 
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nd it 



4. Discussion 

The idea of combining the strength of different 
methods is not entirely new in either nuu-hine 
.earning research (Wolpert, 1990) or p.otein 
secondary structure prediction. For example, Biou 
et al (1988) used certain rules to combine three 
methods. However, the authors did not explain how 
their rules were generated in the first place T ms it 
is difficult for us to justify the use of those rul « "in 
our hybrid system, the Combiner learns h, w to 
combine the outputs of different experts aut< .unti- 
dily from the training data. A novel procedu e ha* 
to be developed to train the Combiner lw cause 
different experts can have very different behaviors 
•For example, after training, some experts on, lie 
100 /o correct on the training data set while others 
may be only 70% correct on the training data even 
though they have very similar prediction accuracies 
tor proteins not in the training set. Our t^inim- 
procedure for the Combiner can cope with e iperts 
that have such different characteristics 

This work showed that although differeni algo- 
rithms may have very similar overall seenidaVv 
structure prediction accuracies, their d( tailed 
predictions can be different. No single al* rithm 
always gives a better prediction than". -there 
A combination of them can produce a statis ioallv 
TZTL im P r< : verne,lt over each individual 
method. We developed a way to train a Con biner. 
which learned to combine the outputs of di ferent 
experts automatically. A neural network we* used 
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as the Combiner in this work. But it is not the only 
choi e. A M.BR system, for .example, can also be 
used as a Combiner. This paper is the first place 
whee the SM algorithm and the particular MBR 
disti nee function have been introduced. Their accu- 
racy were as good as or even better than any other 
sing - algorithm reported to date for secondary 
stru ture prediction. They deserve a more detailed 
disc ission, which is beyond the scope of this paper 
and is done elsewhere (X. Zhang, unpublished 
resu ts). The techniques we used to control the 
trail ing of artificial neural networks were not only 
objc 'tive but also effective. For a single one-hidden- 
laye- network, the accuracy was 63*1% with our 
tech -mines (to control training purely based on the 
trail ing data). Whereas the other approach, to 
monitor the performance of the network on the tett 
date during training, was 635%. The different 
bet\ een them was only 04%. Thus our techniques 
proc uced near-optimal training. 

0 \e of the reviewers of this paper raised the issue 
of whether residues assigned to state G by the DSSP 
program (Kabsch & Sander, 1983a) should be cot - 
sideed as in helix, especially when they aie 
adjacent to state H. In our original experiments, we 
wan:ed to make our result directly comparable with 
resu ts obtained by other researchers, such as Qian 
& Ssjnowski (1988), since the main point of this 
pap< r is that for the same secondary structure 
assij nment, the hybrid system gives better predic- 
tion than other algorithms. Thus we used the same 
assignment as Qian & Sejnowski (1988), i.e. only 
considering H for a helix and E for P strand. Afttr 
we eceived the reviewer's comments, we did the 
follcVing experiment: we assigned G states to be 
heli:; if they are adjacent to H, otherwise assigning 
then to be coil. This way, among the 19,861 
residues in our database, 102 residues (0 8% of the 
tota' residues) were assigned differently, i.e. to helix 
instead of coil. Then we compared the original 
prec iction of our hybrid system with this new 
assignment. Tt is 66-1% accurate. This is very do? e 
to t»ie original accuracy of 66*4%. The change in 
acci racv (0*3%) is much smaller than the change in 
the assignment (08%). This means that even 
thovgh the hybrid system was trained with a 
different assignment, it can still predict correctly 
mos: of the new assignment. This is in accordance 
wit! observations by other researchers (e.g. 
Richardson & Richardson, 1988) that there are 
certiin ambiguities , on secondary structure 
bou \daries assigned by DSSP. 

G >od criteria for evaluating and comparing 
different prediction algorithms are crucial for the 
progress of this research field. In this work, we made 
use :>f the significance interval measure from statis- 
tics, which could tell us whether the differences 
obst rved are significant or not, and what factors can 
infh ence that. We emphasize the importance of ti e 
fact that in our tests, the hybrid system never 
looted at the test data during training, thus making 
the performance of the system on the test data its 
objective as possible. The k-way cross-validation 



allowed us to test our hybrid system with as many 
data as we have, and yet still avoided overlapping 
between the test data and training data. Some 
researchers hav? used one protein in each test 
group, thus maximizing the training data size. 
However, the extremely large amount of compu- 
tation in our woi k prevented us from doing that (i.e. 
k= 113, the total number of protein sequences of 
our database). We choose h = 8, which did not 
reduce the size of each training data set very much, 
and yet cut the amount of computation dramati- 
cally" Even so, i. large amount of computation was 
still needed to carry out our experiment. This 
involved (1) computing many statistics for SM and 
distance matrices for MBR; (2) pattern matching 
and sorting through the whole database to find 
neighbors in MBR; and (3) training many neural 
networks with large numbers of input/output 
examples. The experiment was done on a massively 
parallel computer Connection Machine CM-2. The 
particular machine we used had 40% processors. In 
general, CM-2 cmi have up to 65,536 processors. 

There are many important issues in protein 
secondary structure prediction, such as: (1) is "the 
percentage of correctly predicted residues" the best 
measure for success] (2) What is the best way to 
assign the secondary structures to a protein once 
its three-dimersional co-ordinates are known? 
(3) What is the right criteria for homology in 
selecting test/training data? A comprehensive 
discussion of these issues is beyond the scope of this 
paper. The emphasis here is to demonstrate that our 
hybrid system ^ives significantly better perform- 
ance than individual algorithms and all previous 
methods, using the same criteria in selecting data 
and the same accuracy measure as used by other 
researchers. 
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ABSTRACT A rigid domain, defined here 
as a tertiary structure common to two or more 
different protein conformations, can be identi- 
fied numerically from atomic coordinates by 
finding sets of residues, one in each conforma- 
tion such that the distance between any two 
residues within the set belonging to one confer- 
mation is the same as the distance between the 
two structurally equivalent residues within the 
set belonging to any t>ther conformation The 
distance between two residues is taken to be 
the distance between their respective « carbon 
atoms. With the methods of this paper we have 
found m the deoxy and oxy conformations of 
the human hemoglobin a.p, dimer a rigid do- 
main closely related to that previously identi- 
fied by Baldwin and Chothia (J. Mol. Biol 129- 
176-220 1979). We provide two algorithms, both 
using the difference-distance matrix, with 
winch to search for rigid domains directly from 
atomic coordinates. The first finds all rigid do- 
mains in a protein but has storage and process- 
ing demands that become prohibitively large 
with increasing protein size. The second, al- 
though not necessarily finding every rigid do- 
mam, ls computationally tractable for proteins 
of any size. Because of its efficiency we are able 
to search protein conformations recursively for 
groups of non-intersecting domains. Different 
protein conformations, when aligned by super- 
imposing their respective domain structures 
can be examined for structural differences in 
regions complementing a rigid domain. 

© 1995 Wiley-Liss, Inc. 

Keywords: difference-distance matrix, hemo- 
globin rigid core, structure search 

INTRODUCTION 

Structural domains in proteins have been defined 
m numerous ways, among the better known being 
visually recognizable conformational regions 1 and 
sets of proximate residues within difference maps 2 
More quantitative definitions include clustering 3 
use of cutting planes," minimization of interfacial 
surface area, maximization of solvent exclusion B 
minimization of specific volume, 7 isolation of coher- 



ent regions from normal mode analysis, 8 and maxi- 
mization of compactness. 9 

In this paper tertiary structures existing in differ- 
ent protein conformations define a rigid domain if 
the distance between any two residues of the rigid 
domain structure in one conformation is the same as 
the distance between the two equivalent residues of 
the rigid domain structure in every other conforma- 
tion The distance between two residues is defined to 
be the distance between their respective a carbon 
atoms, which can be found from atomic coordinates. 
The tertiary structures defining a rigid domain in 
different protein conformations are geometrically 
congruent and can be superimposed by aligning 
their equivalent residues. The residues of a domain 
(we shall refer to a rigid domain simply as a domain) 
do not have to be sequentially or spatially contigu- 
ous. The conformations being searched for domains 
must have their primary sequences at least partially 
aligned prior to implementing the algorithms of this 
paper, for this reason our methods are easily ap- 
plied to the T and R states of an allosteric protein 
No assertions are made about the persistence of 
structural rigidity of a domain along transitional 
pathways between conformations. 

Figure 1 illustrates the concept of a domain in an 
eight-residue peptide with two conformations, A and 
lhe heav y hne connecting successive a carbon 
atoms represents the peptide backbone. Distances 
between all pairs of a carbon atoms are shown by 
dashed lines in conformation A. Five residues form a 
domain within the peptide, as shown by the dashed 
ines m conformation B, which indicate that the dis- 
tances between all pairs of residues in the domain 
are the same in both conformations. No one of the 
other three residues can be included in the domain 
because its distance from at least one of the five 
residues of the domain is not the same in conforma- 
tion A as in conformation B. 

We would like to have tertiary structures that are 
nearly but not exactly congruent to each other nev- 
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A B 
R n i A conformational change in an eight-residue peptide in 

-carbon backbone of the pept.de. 



ertheless to define a domain. For these rtmbw 
distances between equivalent residue; will differ 
among conformations. Differences in distances can 
ahse from insignificant dissimilarities between 
structures defining a domain or from experimental 
uncertainty in coordinates. To allow us to include 
geometrically incongruent structures as domain we 
generalize our definition of a rigid domain by spec- 
Sing a parameter e so that the distance between 
two residues of a domain in one conf tarnation ^can 
differ from the distance between the structurally 
equivalent residues of the domain in another confor- 
mation by as much as e. The number of residues 
"eluded fn a domain then depends on the value cho- 
sen for e. Domains found with small values of t re- 
veal more detailed differences in structure between 
conformations, while domains found with larger ■ val- 
ues of e identify gross similarities among conforma- 
tions. When searching a group of Pjotem conforma- 
tions for domains, a good initial choice for s is the 
precision measure of the atomic coordinate > Jhe ef- 
ficacy of the methods of this paper for identifying 
structural similarities in protein confections » 
due in part to their not relying upon a least-square 
measure of similarity to identify domains but only 
upon the maximum absolute deviation in inter-res- 
idue distance as given by e. . 

Two distinct domains may have residues in com- 
mon or be entirely disjoint. The minimum number o 
Sues in a domain can be as small as two and stiU 
be consistent with our definition, but the maximum 
number of residues in a domain is limited only by 
the number of residues in the protein. The hemoglo- 
► bin molecule can serve to illustrate these points. He- 
moglobin consists of two monomers, termed « and p. 
Two copies of each monomer associate to form the 
native tetramer. ^he hemoglobin structure has been 
solved by X-ray Crystallography in both oxy and 
deoxy forms. Coordinates for human deoxyhemoglo- 



N(J) 



643 
5341 
29121 
U4643 
343572 
808298 
1520258 

9 2311635 

10 2861660 
11 



12 
13 
14 
15 
16 
17 
18 
19 
20 
21 



2895704 
2398436 
1623937 
894886 
398055 
141005 
38937 
8098 
1196 
112 
5 



C(RJ) 



861 
11480 
1 11930 
850668 
5245786 
26978328 
26978328 
445891810 
1471442973 
4280561376 
11053116888 
25518731280 
52860229080 
98672427616 
166509721602 
254661927156 
353697121050 
446775310800 
513791607420 
538257874440 



Largest rigid domains: 




Residues common to all of ihe largest rigid domains: 
{7 8 20 21 23 24 25 27 28 29 30 33 36 37 38 39 41 42) 

Fin 2 An exhaustive determination of domains within the 
Ftg. An wiwusu monomer of human 



bin- and human oxyhemogkbin" wer obtained 
from the Protein Data Bank 12 as entries 2HHB i and 
1HHO, respectively. The search of residues 1-42 of 
the a, monomer (the N-terminus, A B and C heli- 
ces) for domains whose inter-residue distances differ 
by no more than 0.30 A between deoxy and oxy con- 
formations finds 643 domains with two residues 
each The five largest domains have 21 residues 
each and have 18 residues in common. These results, 
appearing in Figure 2, will be discussed further in 
following sections. 

Different protein conformations can be aligned by 
superimposing a common domain. A measure of how 
well domain structures align ,s the root-mean 
square (RMS) fit of superimposed domain residues: 



RMS 



'2 



\x'f + kyt + 
R 



At 2 + Av 2 + M l is the squared distance between 
corresponding residues in two different supenm- 
posed domain structures of* residues eac h. RMS 
comparisons of entire conformations tend to conceal 
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small differences in structure, differences that are 
readily apparent when domain structures found 
with a sufficiently small e are superimposed. Many 
others have addressed aspects of structural align- 
ment, among them Vriend and Sander 13 and Holm 
and Sander. 14 ■ 

Within the hemoglobin tetramer, the identical 
ot.B, and a 2 p 2 dimers undergo a quarternary reori- 
entation relative to one another with the change 
from the deoxy to the oxy conformation. ■' Baldwin 
and Chothia 16 found a group of residues within the 
a.p, dimer interface whose a carbon atoms remain 
rigid during this allosteric transition. With our 
methods we find such a set of a carbon atoms as well 
and refer to it as the rigid core of the dimer. The 
dimer core can be used to align the oxy and deoxy 
structures to reveal conformational changes with 
ligand binding. 

METHODOLOGY 
Calculating the Difference-Distance Matrix 

The initial computational step for finding do- 
mains from atomic data is to construct distance ma- 
trices and to use them to find the difference-dis- 
tance matrix. For a given conformation, elements of 
a distance matrix D tj are the distances between a 
carbon atoms i and j. With Dj" the distance matrix 
for one conformation and D v m the distance matrix 
for another conformation, the difference-distance 
matrix 17 -' 9 is given by the absolute value of the 
matrix difference, 

aHV'-d.:/ 2 '!- {1) 

For computational efficiency, a matrix 8„ is defined 
such that if 



Ay>E, = 0, 



(2) 



Residue pairs i, j with a change of inter-residue dis- 
tance (between a carbon atoms) of no more than e A 
have matrix elements i e = 1; otherwise &<, = 0. The 
value chosen for e will depend on the purpose of the 
calculation, being small if we seek subtle differences 
between conformations or large if we are searching 
for gross similarities among conformations. 

Exhaustive Search For Domains 

An exhaustive search for domains within a poly- 
peptide constructs all rigid residue pairs and from 
this set finds rigid triples, quadruples, and so forth, 
until all combinatorial possibilities have been con- 
sidered. That is, an exhaustive search begins with 
the set of all pairs of residues i,j for which 8^ = 1 
and from this set finds the set of all distinct rigid 
triplets of residues k. A triplet is rigid if 8 y - 1 
for i and j any of the residues in the triplet. The 
search is iteratively enlarged by finding every do- 
main with J + 1 residues from each domain with J 



residues until all possible combinations of residues 
have been exhausted. Figure 2 illustrates this 
method. The number of residues in a domain is J 
while N( J) is the number of distinct domains with J 
residues. The binomial coefficient C(R,J) is the 
number of possible subsets of J residues, rigid or 
otherwise, that can be found in a set of R residues. 
The number of domains N(J) for each J can be fairly 
large although it is usually much smaller than the 
number of possible subsets C(K,J). From Figure 2, 
for example, 4,280,561,376 subsets of 11 residues ex- 
ist within a set of 42 residues, but only 2,895,704 
domains of 11 residues each can be identified when 
e = 0 30 A in the first 42 residues (the N-terminus, 
A B and C helices) of the a, monomer of human 
hemoglobin. Figure 2 is further examined in the 
Discussion section. Searching for domains in this 
way is computationally demanding because the 
number of subsets in a set of R residues increases 
rapidly with R. We need to look for a faster way to 
find domains. 

An Incomplete but Fast Search for Domains 

We now introduce an alternative method for find- 
ing domains that is computationally feasible for a 
polypeptide of arbitrary length. With this method, 
only a single domain is identified for J residues 
| rather than the N(J) domains required for an ex- 
haustive searchl, with J near the maximal domain 
size The complement of this domain is then 
searched exhaustively to identify all larger domains 
that include it as a subdomain. The method saves 
considerable computational effort and will still find 
domains suitable for conformational comparison. 

We assert that a residue i differing by a small 
amount in relative position from one conformation 
to another will have many of its 8 0 = 1, while a 
residue differing by a large amount in relative po- 
sition will have many of its h 0 = 0. To quantify the 
changes in residue position, sums S; of 8,-, are eval- 
uated for each residue i over all other residues j in 
the polypeptide: 

j 

The residues for which position differs the least be- 
tween conformations tend to have the largest 
while those for which position differs the most be- 
tween conformations tend to have the smallest S f . 

The search for a domain is initiated by choosing 
an integer N, and finding all those residues i for 
which S ^N The set of residues for which this con- 
dition is true is defined as UJN,). U r (N,) will be the 
entire protein when N, is zero and will have only the 
more rigid residues as N, approaches the number of 
residues in the protein. l/.CN.) is usually not itself a 
domain, because distance matrix elements between 
residues in U r (N,) for one conformation can differ by 
more than e from those for other conformations. 
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However, the following strategy will find at least 
one domain in U^(N s ) f if any exist. 

For each residue i in U e (Nj find all other residues 
j in UJNJ for which h a = 0, that is, such that the 
residue pair ij is not rigid. The residue i that has 
the largest number N M of residues j for which 8 V - = 
0 is removed from U & (N S ), leaving a subset of UJLNJ 
with one less residue. [The subsets of U e (N s ) will be 
affected by the order in which residues with the 
same value for N M are deleted.] The reduction is 
repeated until N M < 1. This subset of U K (NJ is a 
domain except for possible non-rigid pairs of resi- 
dues. A domain D^(N S ) can then be constructed from 
this subset by removing all non-rigid pairs. 

D H (NJ can be enlarged by searching its comple- 
ment exhaustively to find all domains that preserve 
D^(N S ) as a subdomain. Among these will be do- 
mains found by adding back some residues that were 
previously removed as non-rigid pairs in U^NJ, but 
other domains are often discovered as well. 

Constructing a domain of J residues using the 
method described in this section is much faster than 
finding one by exhaustive enumeration of all N{K) 
possibilities, as K grows from 2 to J. By choosing N s 
appropriately, the domains found by enlarging 
DJN S ) will generally be maximal or, if not, will have 
residues in common with the maximal domains. The 
algorithm to find D^(N S ) is most efficient for values 
of N s near R, the number of residues in the protein, 
because construction of the domain D V (N H ) is compu- 
tationally fast, and the exhaustive search through 
the complement of D E (NJ will not have to find many 
larger domains. 

A domain can be used to align protein conforma- 
tions by translating the centroid of the domain for 
each conformation to the coordinate origin and ro- 
tating the domain of one conformation onto that of 
the others with methods originally described by 
Kabsch. 20 * 21 The resulting transformed coordinates 
give a least-square fit between the domains of the 
different conformations. The entire protein can now 
be visually or numerically investigated for confor- 
mational differences in other regions. 
A summary of the above algorithm follows. 

I. Read the coordinates of all residues i for each 

conformation. 
II. Construct the distance and difference-distance 
matrices. 

A. Choose e. [See the remarks about choosing e 
after Eq. (2) above.] 

B. Calculate the difference-distance matrix A u 
for all pairs of residues 

C. If Ay > e then b u = 0; otherwise 8 (/ = 1. 
III. Find a domain (not necessarily the largest). 

A. Choose N 3 . 

B. Calculate S, for each residue L 

C. For each i t if S ( > N s then include residue i in 
the set UJNJ. 



D. For each i in the set U K (N S ), find all residues 
j also in U K (N S ) for which 8 y = 0. 

E. Remove from U K (NJ that residue i that has 
the most other residues j for which 8 y = 0. 

F. When for every residue / remaining in U e (N s ) 
at most only one other residue j can be found 
for which 5 r/ = 0, remove both i and j from 
U t XNJ to give D e (N s l Otherwise repeat HI D. 
and HI E. 

IV. Search for larger domains. 

A. Examine each residue j in the complement of 
D^{N S ) to see if 8 y = 1 for all residues i in 

B. For each j for which all 8 V - = 1 in IV.A., in- 
clude v in D E (N S ) to form a domain one residue 
larger. 

C. Repeat IV.A. and IV.B. with each such do- 
main until no larger domains can be found. 

DISCUSSION 

We illustrate the methods defined above by search- 
ing for domains in the first R = 42 residues of the a, 
monomer of human hemoglobin (the N-terminus, A, 
B, and C helices). (The reader will please note that we 
are using hemoglobin as a convenient example for 
the application of these methods; we make no pre- 
tense to a thorough study of this protein in this pa- 
per.) As previously, J is the number of residues in a 
domain, N(J) is the number of domains with J res- 
idues, and C(R,J) is the number of possible subsets of 
J residues in a set of R residues. Figure 2 outlines an 
exhaustive search for domains within this peptide 
when e = 0.30 A. The number of rigid residue pairs 
is 643 while the number of possible pairs of residues 
is 861. Similarly, the number of possible triplets is 
11,480, but only 5,341 of them are rigid. The number 
of possible sets C(R f J) grows combinatorial^ with J, 
while the number of domains eventually converges to 
5 when J is 21. The 18 residues common to all five 
largest domains are listed at the bottom of the figure. 
The rigidity of these five largest domains is assessed 
by superimposing the deoxy and oxy structures, 20,21 
with RMS fits as shown in the Figure. For confor- 
mation alignment all the largest domains are effec- 
tively the same, as can be seen in Figure 3. 

The computational demands of an exhaustive 
search for domains are apparent in Figure 2. The 
number of sets with J or fewer residues that could be 
rigid is the sum of all the binomial coefficients C{R t J) 
from 2 through J, a number that grows exponentially 
with J. The fast search avoids such an encumbrance 
by finding only one of the N(J) domains, the domain 
^u.aoWJ, and exhaustively enlarging only this one 
domain. With a value N s = 25, the search of the 
difference-distance matrix results in a set U O 30 (25) 
of 35 residues: 

£W25) = 

{1 2 3 6 7 8 9 10 13 14 18 19 20 21 22 23 24 25 26 
27 28 29 30 3 1 32 33 34 35 36 37 38 39 40 41 42}. 
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The residue belonging to the most non-rigid residue 
pairs is residue 14, which is not rigid with N M = 12 
other residues in t/ 030 (25) (see table below). After 
deleting residue 14 from C/ 0 30 (25), the subsequent 
search for the residue with the largest sum N M finds 
residue 22, with ten non-rigid pairs. Iteration until 
N M is not larger than 1 results in the removal of 13 
residues from £/ 0 30 (25), which leaves a set of 22 res- 
idues, the last one removed being residue 40. The 
following summarizes the deletion of residues from 

J EXCLUDED RESIDUE N M 



35 14 12 

34 22 10 

33 34 8 

32 26 7 

31 23 6 

30 10 6 

29 21 5 

28 7 5 

27 , 25 4 

26 29 3 

25 8 3 

24 42 2 

23 40 2 



A set of 22 residues with two non-rigid residue pairs, 
(19,41) and (32,35), remains after the removal of res- 
idue 40. Deleting both non-rigid residue pairs leaves 
a domain D 0 30 (25) with 18 residues: 

A)3o(25) = {1 2 3 6 9 13 18 20 24 27 28 30 31 33 
36 37 38 39}. 

A).3o< 2 5) is only one of the 8,098 domains with 18 
residues found exhaustively in Figure 2. An exhaus- 
tive search through the complement of D o so (25) 
finds four domains with 20 residues each, the two 
additional residues being one from each of the non- 
rigid residue pairs (19,41) and (32,35): 

{1 2 3 6 9 13 18 19 20 24 27 28 30 31 32 33 36 37 

38 39} RMS = 0.191 A 
{1 2 3 6 9 13 18 19 20 24 27 28 30 31 33 35 36 37 

38 39} RMS = 0.186 A 

{1 2 3 6 9 13 18 20 24 27 28 30 31 32 33 36 37 38 

39 41} RMS = 0.213 A 

{1 2 3 6 9 13 18 20 24 27 28 30 31 33 35 36 37 38 
39 41} RMS = 0.205 A 

The RMS value for the superposition of each oxy 
domain upon its deoxy counterpart is listed after 
each domain. No other residues could be found in the 
complement ofZ> 0 30 (25) that would fit rigidly in any 
of the four domains listed above. 

We now compare the above with the exhaustive 
search of Figure 2, which revealed 112 domains of 20 
residues each but only 5 domains of 21 residues 
each. The first two domains above are actually more 
rigid, in the serise of smaller RMS, than any of the 




Fig. 3. The superposition of the oxy (dashed) upon the deoxy 
conformation of the A through C helices of the a t monomer of 
hemoglobin. A: Superposition using a domain found from 
0o.3o(25) with an RMS fit of 0.186 A. 8: Superposition using a 
domain found from D 030 (36) with an RMS fit of 0.213 A. The 
N-terminus is at the lower right and C-terminus of the C helix is at 
the upper left for both A and B views of the superimposed peptide. 
The graphical superpositions illustrate the differences between 
these two domains but show their equivalence for structural com- 
parison. 

five domains with 21 residues found exhaustively in 
Figure 2. The four domains found above also inter- 
sect extensively with the 5 of 21. Of the 18 residues 
common to the five domains with 21 residues, 10 
also occur in the above four domains of 20 residues. 
Thus the four domains found above are close approx- 
imations to the five largest domains found exhaus- 
tively in the structure, which themselves, because of 
their extensive overlap, represent what is essen- 
tially one domain. As a matter of interest, the RMS 
fit for the entire peptide using the centroid of all 42 
residues is 0.324 A. 

We resume our search of the first 42 residues of 
the ol 1 monomer of human hemoglobin to see the 
dependence of both U P (N S ) and D F (N 8 ) upon N 8 . Re- 
peated trials show that 36 is the smallest value for 
N s that leads to a set U 0 30 (N S ) of residues common 
to all the largest domains of the exhaustive search of 
Figure 2. Because U 0 30 (36) is a domain and no res- 
idues need to be removed from it, U 030 (36) and 
1*0 30(36) are identical. 

*W36) = D 0 30 (36) - {20 27 29 30 33 41}. 
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The complement of D 0/ti0 (36) is now searched for res- 
idues rigid with those already in D 0>30 (36), adding 
such residues one at a time to D 0 . 3 o(36), just as in 
Figure 2, but now with many fewer combinations to 
examine. This gives the following table, in which 
the first column is the size of a domain and the sec- 
ond is the number of domains of that size for which 
D 0 30 (36) is a subdomain: 



J 


N(J) 


6 


1 


7 


25 


8 


245 


9 


1306 


10 


4478 


11 


10816 


12 


19290 


13 


26008 


14 


26827 


15 


21278 


16 


12953 


17 


5981 


18 


2038 


19 


485 


20 


72 


21 


5 



No larger domains exist to 0.30 A. The five domains 
of 21 residues each in the last line of the table are 
identical to those revealed by the exhaustive search 
of Figure 2. 

Thus we have obtained with much less computa- 
tional effort the results of the exhaustive search and 
have also found other slightly smaller domains 
which are actually more rigid. A sample of the re- 
sults is shown in Figure 3, where a selected two of 
the domains have been used to align the first 42 
residues of a x human hemoglobin. The peptide is 
aligned in (A) by superimposing the oxy and deoxy 
forms of the 20 residue domain with an RMS fit of 
0.186 A derived from D 0 30 (25) t whilcthe alignment 
in (B) results from similarly superimposing the last 
of the five domains listed at the bottom of Figure 2, 
which has an RMS fit of 0.213 A. 

The various domains that we have found can be 
thought of as constituting a family of closely related 
domains that represent one rigid object with minor 
variations. As seen in Figure 3, two domains of this 
family are practically equivalent for alignment pur- 
poses. 

A Method for Finding Rigid Domains From a 
Subdomain of D^iNJ 

Larger domains can be found from the domain 
DJWJ by including with D K (NJ all combinations of 
residues within the complement of D^NJ that are 
rigid with D E (NJ. All these larger domains contain 
DJN S ) as a subset. In place of D H (N a ) 9 however, any 
subset of D e (iV s ) can be used as the domain common 



to all subsequently larger domains, since any subset 
of D^NJ is also a domain. By doing so we can find 
domains that have specific characteristics we may 
wish to retain. For example, residues that are either 
spatially or sequentially separate from the rest of 
the residues in D t (NJ can be eliminated. The sub- 
sequent search through the complement of this sub- 
domain of D E (NJ will then lead to larger domains 
within the protein that retain the desired residues of 
the subdomain. 

We show how this works with the example of the 
previous section, the first 42 residues of the a x 
monomer of human hemoglobin. D o zo {25) includes 
residues 1, 2, 3, 6, 9, 13, and 18, all of which are part 
of the A helix or N-terminus. Removing these from 
Z) 0 30 (25) leaves a subdomain lying only within the B 
and C helices: 

A).3o(25) mod = {20 24 27 28 30 31 33 36 37 38 39}. 

A search through the complement of this subdomain 
of D 0 30 (25) finds three of the five largest domains 
found with the exhaustive search of Figure 2. These 
domains have more B helix (residues 20-35) and 
less A helix (residues 3-18) than the four domains of 
20 residues each first found above. The search is as 
follows: 



J 


N(J) 


11 


1 


12 


21 


13 


162 


14 


626 


15 


1366 


16 


1780 


17 


1424 


18 


709 


19 


218 


20 


39 


21 


3 



A modest computational effort can sample a family 
of domains in a polypeptide and thereby escape the 
exponentially increasing demands for processor 
time and storage space required by an exhaustive 
search. 

RESULTS 

Rigid Core of the <x i p l Dimer of 
Human Hemoglobin 

The non-exhaustive method for finding domains is 
applied here to the entire human hemoglobin 
dimer. Exhaustively finding domains in a hemoglo- 
bin dimer would be unacceptably slow. The two con- 
formations of the dimer in which we shall look for 
domains are the deoxy conformation 2HHB 10 and 
the oxy conformation 1HHO, 11 both from the Pro- 
tein Data Bank. 12 A hemoglobin dimer is con- 
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structed from two monomers «, and Pl with 141 and 
146 residues, respectively. hemoglo- 
W firs t search for domains „ i the & 

bin . dimer J both the deoxy and oxy dimers have 
matrices for both the « ^ y matrijt „ 

been ™£J 200 is chosen for N, 

computed, and _ a tnal val ^ 
This is shown ,„ Figure 4 n ^ ^ 

in the set U oM f.W» are f non . rig id res- 

ent with the ^^J^^W 
^^i 'jffi" is removed from 
°J ^cASr^Slr - non-rigid residues fcr 
e^hfeie in the remaining subset s-a^W. 
The next least rigid residue is «,72 «Jj£™ „ „ 

id in **^Z*%£Z»* in 

then ^^-^7^ of residues remains for 
this manner until a subset o ^ 

which each member is not «J^£et of 86 resi . 
other residue in the subset, lhis suDsei 

onrl m 22 3,130). Removing all me non 

and l^^APi 1 ^ f20Q , wn ich has 74 

are residues 0^24, ot^/, a i ou - > ' 1 largest 
„ 11Q on jQi94 The residues of one ot tne larger 

f rp^idue nairs removed from U 0 . 75 ^ iu; 

^T'Z D ^ (2lS) One of these 16 domains is 
when finding X> 0 .76^ A "/- vuc n c res idue do- 

"Sdeoxy reridues of the chosen repr-atative 

^r^rl.t.on. - t he 
same structure. 



Effect of Changing e 

Non-zero difference-distance matrix elements A, 
de^d b r y Eq. (D owe <^£*^?ZZ 

and / in conformation (1) is u y a» 
between the same residues in conformafcon (2) U 

Di,' 2) , then 



(2) _ 



JT)..' 1 ' 



+ -Yi/ + Xi 



(4) 



(5) 



Matrix elements are then 

A„ = W.:/+Xi,l- 
We will assume no correlation between the A, and 
will consider only those *f^$firt^ 
ing cons.dered » PjJJg^ of a fami l y of rigid 

■ lowing argume n t in Qne 

tan^ mix for this set of N residues » now 



(6) 



terion is 



(7) 
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IW200) 
a i residues: 
I 

36 

1161 

Pi residues: 



J J *t J « J J 9 10 11 12 13 14 15 «« 19 20 31 35 

mS,ss,s,g I s 1 a 1 gggagjggfjgj»ii>.s 
si ill ill 1 1 s iS - - » - 5 » » s ^ .s s :s§ 



EXCLUDED RESIDUE Nu 



04 a , 11 24 

03 a ,72 20 

102 a ,4 19 

"01 P.20 16 

100 a ,16 ii 

& P.55 8 

98 a, 12 8 

97 a, 15 7 

96 p a i08 6 

95 a, 115 5 

94 a, 106 5 

93 a ,71 5 

92 aiU 5 

91 p,38 4 

90 a , 105 4 

89 piI33 3 

88 a,I18 3 

87 p 1 82 2 



Remaining non-rigid pairs: (0,3,0,8) (0,5,0,109) (a,9*5I) (a,10.fcl 15) (0,103,0,120) 0,22,0,130) 

Do.5o(200) 
a, residues: 

siisr " ? ">' SS » » » .8 ,S ,S ,S ,S Jf ,S IS 104 101 

. i " ii'l 120 ill m ra m m i" 128 129 m I32 IS IS? IJJ 112 1,3 ' 14 1,6 



J 


N(J) 


75 


~20 


76 


184 


77 


1032 


78 


3942 


79 


10848 


80 


22180 


81 


34232 


82 


40081 


83 


35436 


84 


23292 


85 


11040 


86 


3568 


87 


704 


88 


64 



An 88 residue domain with RMS = 0.240 1 
a, residues: 

Pi residues: 

.S.. l f.S,S,S,g l S,iJ,S,S.S.S.a. , S. , S. , S!ii!g I , 2 1M - 

residues have a sum S, of at least 200 These Ses form ih« *■ [P° * )(2 .? 0) ' we " nd eighl residues in addilion <° 'hose of 

set </ oso <200). Only 86 of the rwidu^wXTtTtSS arl «t Zl^'T '"s* m "ft W j! h °° *> (200 >- 0n| y one °' lhe '«S- 

non-rigid with at most one other residue i .0,1(200) 0 (200? RM^,aiu» .of f """J" ' he boHom °' lhe ,i9ure alon 9 with ,he 
a domain of 74 residues, is left after all 12 residues begging to ° Xy " de ° Xy su P er P° silion - 
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Fig. 5. Superposition of the deoxy and oxy dimers of 
human hemoglobin by aligning the residues of one of the rigid 
cores found with e = 0.75 A. The view is down the x-axis toward 
the dimer-dimer interface. The a % monomer is to the lower ledt, 
and the fS, is to the upper right. Blue and green, the a carbon 
backbone of the deoxy conformation; red, the oxy conformation. 
About half of the same rigid core was found with visual methods 
by Baldwin and Chothia. 1 * The RMS value for the oxy-deoxy su- 
perposition of this rigid core is 0.333 A. Only deoxy hemes, col- 
ored brown, have been drawn in the figure. Both heme pockets 
clearly undergo considerable conformational change relative to 
the rigid core. 



since there are M(M - l)/2 pairs of the M residues, 
all of which must have A less than e. A total of 
C(N,M) subsets of M residues exists in the set of N 
residues, so the expected number of subsets of M 
residues that define a domain is C{N,M) times Eq. 
(7). We seek the largest value of M for a given e for 
which we find at least one domain within the origi- 
nal set of N residues. Thus the largest M is the in- 
teger closest to the solution of the equation 



C(AW^JV(A)dA^ 



M(M -D/2 



= 1. 



(8) 



Actually, for the hemoglobin dimer, we find that the 
data are fit much better if we assume that there are 
two disjoint subsets, one with m 1 points described by 
Wj(A) with standard deviation of a t and the other 
with m 2 points described by W 2 (A) with a 2 . Eq. (8) 
then generalizes to 



C(N,M)( JVjCAWA 



m\im\-\)/2 



JV 2 (AWAj m * m2 - 1)/2 =1 



(9) 



with M = m l + m 2 . 



i 

epsilon, Ang. Units 



Fig. 6. The dependence of hemoglobin core size on e. Circles 
mark the number of residues in a core domain for each e as found 
from PDB atomic coordinates. The solid line is a best-fit of Eq. (9) 
with a, = 0.20 A and a 2 = 0.86 A to the measured points. We 
have assumed in this calculation that difference-distance matrix 
elements A (> have a Gaussian distribution. 



A best-fit curve of M versus z obtained from the 
solution of Eq. (9) is shown in Figure 6. This curve 
was obtained by varying the parameters a u o- 2 , and 
m, until the sum of the squares of the deviations from 
the points was minimized. The values found for a t 
and <r 2 were 0.20 A and 0.86 A, respectively, in the 
neighborhood of the experimental precision of about 
0.5 A. (The standard deviation of this curve from the 
points is 4.0, whereas the standard deviation found 
when attempting to fit with only one Gaussian subset 
was 25; the one-Gaussian fit was not satisfactory.) 
Thus even this crude theory of the dependence of the 
number of residues M in a domain on e gives a rea- 
sonably good description of the observations. 

In Figure 7 we show the rigid domains found with 
e values of 0.25 A (asterisks) and 0.50 A (circles), 
while Figure 5 shows the 0.75 A domain. The core 
structure appears to be well marked by the 0.50 A 
circles. Increasing e from 0.50 A to 0.75 A mainly 
picks up more residues in the same structure while 
extending the structure only slightly. Apparently 
the principal difference between the 0.50 and 0.75 
cores is that the latter is more tolerant of errors in 
the data. This is in accord with Baldwin and Choth- 
ia's 16 estimate that differences between coordinates 
in their data were not significant unless they ex- 
ceeded about 0.50 A because of experimental uncer- 
tainty in the coordinates. This domain is definitely 
though sparsely marked in Figure 7 even by the 0.25 
A asterisks. Thus the identification of the gross 
structure of a rigid domain is not very sensitive to 
the value of e for sufficiently large e. 

CONCLUSIONS 

Proceeding from the premise that if rigid domains 
exist they should be important components of pro- 
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sSres already defined by smaller values of e. 



tein structure, we have devised two methods for 
finding such domains, and have tr.ed out these 
methods using subunits of hemoglobin. We have 
found that rigid domains occur in families the mem- 
bers of which overlap extensively, that differ by only 
a few residues. The concept of a family of overlap- 
ping domains is an important generalization of the 
rigid-domain concept itself. 

Nearly all the residues belonging to the hemoglo- 
bin dimer rigid core of Figure 5 are found within the 
A B C G or H helices of the Ul and p! monomers. 
Similar structures have been noted before. Baldwin 
and Chothia 16 identified 68 residues that form an 
invariant set along the a.B, interface of the hemo- 
globin dimer. These residues are mostly the parts ot 
the a and B B, G, and H helices and were used as a 
frame of reference by Baldwin and Chothia from 
which to observe the tertiary and quaternary 
changes in hemoglobin. Except for residues 6,30 
and B^l, which are within the interior of the B, a 
helix, and residue B.54, a valine residue in the B, U 
helix, all are included in our family of 16 rigid core 
domains with e = 0.75 A. Baldwin and Chothia 
noted as well that the a B, « C, a G, and « H helices 
and the B B, B D, B G, and B H helices together 
except for the first few residues of the G helices and 
the last few residues of the H helices, remain fairly 
invariant between the T and R states of hemoglobin. 
For larger values of e we find rigid core domains 
that include most of these helices but also many res- 



idues in both the a and B A helices and in the C helix 
of B as well. That the A, G, and H helices form a 
protected folding unit in apo-myoglobin has been 
noted by Hughson et al. 22 

The rigid core is not the only domain that can be 
found in the hemoglobin dimer. By removing the 
rigid core residues from the dimer structure and 
searching the remainder we can find several other 
smaller, independent domains associated with the 
heme molecules. We expect to describe these in an- 
other paper in preparation. 

The primary contribution of this paper is a 
method to determine conserved spatial relation- 
ships. As such, it is directly applicable to analysis of 
complex conformational changes in proteins. Al- 
lostery is one such case; there are others, such as the 
calcium-triggered change in calmodulin or the re- 
arrangement of the hemagglutinin of influenza vi- 

^We have thought about the application to finding 
conserved cores in homologous proteins However, 
that application requires substantial further devel- 
opment. We can calculate conserved structure given 
a sequence alignment, but finding the best sequence 
alignment for identifying conservation of structure 
is a another problem. The discussion of sequence 
alignment would take us far outside the scope of this 
paper. 
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Fig. 5. Superposition of the deoxy and oxy a,p t dimers of 
human hemoglobin by aligning the residues of one of the rigid 
cores found with e - 0.75 A, The view is down the x-axis toward 
the dimer-dimer interface. The a, monomer is to the lower ledt, 
and the p 1 is to the upper right. Blue and green, the a carbon 
backbone of the deoxy conformation; red, the oxy conformation. 
About half of the same rigid core was found with visual methods 
by Baldwin and Chothia. The RMS value for the oxy-deoxy su- 
perposition of this rigid core is 0.333 A. Only deoxy hemes, col- 
ored brown, have been drawn in the figure. Both heme pockets 
clearly undergo considerable conformational change relative to 
the rigid core. 



since there are M(Af - l)/2 pairs of the M residues, 
all of which must have A less than e. A total of 
C{N,M) subsets of M residues exists in the set of N 
residues, so the expected number of subsets of M 
residues that define a domain is C(N,M) times Eq. 
(7). We seek the largest value of M for a given e for 
which we find at least one domain within the origi- 
nal set of N residues. Thus the largest M is the in- 
teger closest to the solution of the equation 



C(N 9 M)^fjV(M^ mM ~ 



1V2 „ 



1. 



(8) 



Actually, for the hemoglobin dimer, we find that the 
data are fit much better if we assume that there are 
two disjoint subsets, one with m l points described by 
Wj(&) with standard deviation of a t and the other 
with m 2 points described by W 2 (A) with ct 2 . Eq. (8) 
then generalizes to 



C(tfW JVi(A)dA 



mi(mi-l)/2 



Jw 2 (MdAJ m * m2 - m =1 



(9) 
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with M - m l + m 2 . 



1 

epsilon, Ang. Units 



Fig. 6. The dependence of hemoglobin core size on e. Circles 
mark the number of residues in a core domain for each e as found 
from PDB atomic coordinates. The solid fine is a best-fit of Eq. (9) 
with rr, = 0.20 A and a a = 0.86 A to the measured points. We 
have assumed in this calculation that difference-distance matrix 
elements A a have a Gaussian distribution. 



A best-fit curve of M versus e obtained from the 
solution of Eq. (9) is shown in Figure 6. This curve 
was obtained by varying the parameters o~ 1( a 2 , and 
m, until the sum of the squares of the deviations from 
the points was minimized. The values found for a l 
and a 2 were 0.20 A and 0.86 A, respectively, in the 
neighborhood of the experimental precision of about 
0.5 A. (The standard deviation of this curve from the 
points is 4.0, whereas the standard deviation found 
when attempting to fit with only one Gaussian subset 
was 25; the one-Gaussian fit was not satisfactory.) 
Thus even this crude theory of the dependence of the 
number of residues M in a domain on c gives a rea- 
sonably good description of the observations. 

In Figure 7 we show the rigid domains found with 
e values of 0.25 A (asterisks) and 0.50 A (circles), 
while Figure 5 shows the 0.75 A domain. The core 
structure appears to be well marked by the 0.50 A 
circles. Increasing e from 0.50 A to 0.75 A mainly 
picks up more residues in the same structure while 
extending the structure only slightly. Apparently 
the principal difference between the 0.50 and 0.75 
cores is that the latter is more tolerant of errors in 
the data. This is in accord with Baldwin and Choth- 
ia's 16 estimate that differences between coordinates 
in their data were not significant unless they ex- 
ceeded about 0.50 A because of experimental uncer- 
tainty in the coordinates. This domain is definitely 
though sparsely marked in Figure 7 even by the 0.25 
A asterisks. Thus the identification of the gross 
structure of a rigid domain is not very sensitive to 
the value of e for sufficiently large e. 

CONCLUSIONS 

Proceeding from the premise that if rigid domains 
exist they should be important components of pro- 
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Pin 7 a view down the x-axis of the hemoglobin dimer with 
residues in a rigid core lor e = 0.25 A marked with asterisks and 

adS = a5 °i marked H W ^ 

ckcles Apparently, the effect of increasing e is to fill in secondary 

structures already defined by smaller values of e. 



tein structure, we have devised two methods for 
finding such domains, and have tried out these 
methods using subunits of hemoglobin. We have 
found that rigid domains occur in families, the mem- 
bers of which overlap extensively, that differ by only 
a few residues. The concept of a family of overlap- 
ping domains is an important generalization of the 
rigid-domain concept itself. 

Nearly all the residues belonging to the hemoglo- 
bin dimer rigid core of Figure 5 are found within the 
A B C, G, or H helices of the a t and p t monomers. 
Similar structures have been noted before. Baldwin 
and Chothia 16 identified 68 residues that form an 
invariant set along the interface of the hemo- 
globin dimer. These residues are mostly the parts ot 
the a and p B, G, and H helices and were used as a 
frame of reference by Baldwin and Chothia from 
which to observe the tertiary and quaternary 
changes in hemoglobin. Except for residues p,30 
and p^l, which are within the interior of the p x B 
helix, and residue p x 54, a valine residue in the p 1 D 
helix, all are included in our family of 16 rigid core 
domains with e = 0.75 A. Baldwin and Chothia 
noted as well that the a B, a C, a G, and u H hehces 
and the p B, p D, p G, and p H helices together 
except for the first few residues of the G helices and 
the last few residues of the H helices, remain fairly 
invariant between the T and R states of hemoglobin. 
For. larger values of e we find rigid core domains 
that include most of these helices but also many res- 



idues in both the a and p A helices and in the C helix 
of p as well. That the A, G, and H helices form a 
protected folding unit in apo-myoglobin has been 
noted by Hughson et al. 22 

The rigid core is not the only domain that can be 
found in the hemoglobin dimer. By removing the 
rigid core residues from the dimer structure and 
searching the remainder we can find several other 
smaller, independent domains associated with the 
heme molecules. We expect to describe these in an- 
other paper in preparation. 

The primary contribution of this paper is a 
method to determine conserved spatial relation- 
ships. As such, it is directly applicable to analysis of 
complex conformational changes in proteins. Al- 
lostery is one such case; there are others, such as the 
calcium-triggered change in calmodulin, or the re- 
arrangement of the hemagglutinin of influenza vi- 
rus. A „ j. 

We have thought about the application to finding 
conserved cores in homologous proteins, However, 
that application requires substantial further devel- 
opment. We can calculate conserved structure given 
a sequence alignment, but finding the best sequence 
alignment for identifying conservation of structure 
is a another problem. The discussion of sequence 
alignment would take us far outside the scope of this 
paper. 
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Detection of Common Three-Dimensional 
Substructures in Proteins 
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ABSTRACT We present a fully automatic 
algorithm for three-dimensional alignment of 
protein structures and for the detection of com- 
mon substructures and structural repeats. 
Given two proteins, the algorithm first identi- 
fies all pairs of structurally similar fragments 
and subsequently clusters into larger units 
pairs of fragments that are compatible in three 
dimensions. The detection of similar substruc- 
tures is independent of insertion/deletion pen- 
alties and can be chosen to be independent of 
the topology of loop connections and to allow 
for reversal of chain direction. Using distance 
geometry filters and other approximations, the 
algorithm, implemented in the WHAT IF pro- 
gram, is so fast that structural comparison of a 
single protein with the entire database of 
known protein structures can be performed 
routinely on a workstation. The method repro- 
duces known non-trivial superpositions such as 
plastocyanin on azurin. In addition, we report 
surprising structural similarity between ubi- 
quitin and a (2Fe-2S) ferredoxin. 

Keywords: protein structure comparison, su- 
perposition, clustering, folding 
units, sequence alignment 

INTRODUCTION 

Comparison of protein structures has many areas 
of application. Three-dimensional similarity can be 
used to produce protein alignments in cases where 
sequence similarity is so weak that sequence align- 
ment programs fail. 1 Structure-based sequence 
alignment can reveal evolutionary relationships 
and provide the basis for the construction of phylo- 
genetic trees. 2 Multiple alignment of structures nat- 
urally leads to the definition of a common structural 
core of a protein family, 3 to the identification of 
structurally important conserved contact regions, 4 
and to the detailed study of residue replacements in 
conserved structural context. 5 

The principal difficulty in comparing three- 
dimensional protein structures is that of identifying 
structurally equivalent residues. Once a list of 
equivalent residues is known, elegant solutions to 
the problem of optimal superposition in 3-D 6 can be 
used to produce explicit coordinates of one protein in 
the framework of the other. Superficially, the equiv- 



alencing problem is similar to the problem of one- 
dimensional alignment of amino acid sequences. 
There is, however, an added complication in that 
clusters of residues locally similar in three-dimen- 
sional space may involve chain regions separated by 
many residues, i.e., arranged non-local ly in se- 
quence space. It is therefore not sufficient to com- 
pare one-dimensional neighborhoods in sequence 
space, but also necessary to compare three-dimen- 
sional neighborhoods in real space. For this reason, 
one-pass dynamic programming algorithms are not 
suitable for this problem. 

Several authors have invented generalizations of 
sequence alignment algorithms in order to solve the 
3-D equivalencing problem. For example, Taylor 
and Orengo 7 first define a local measure of similar- 
ity between any two sequence positions in two pro- 
teins by aligning the contact environments of each 
residue in protein A with that of each residue in 
protein B, using a dynamic programming algorithm. 
Subsequently, they solve the one-dimensional align- 
ment problem in terms of new local similarities de- 
rived from the first step, again by dynamic program- 
ming. The algorithm can be thought of as solving 
the problem of aligning two contact maps (or dis- 
tance plots), allowing insertions and deletions but 
adhering strictly to the sequential order of residues 
along the chain. This method is conceptionally neat 
and works well, but it is costly in computer time, as 
the algorithm is of order N(A) 2 N(B) 2 , where N(X) is 
the chain length of protein X. Sali and Blundell 8 use 
a Monte Carlo method, simulated annealing, to deal 
with the complexity of optimizing structural super- 
position, whereas Zuker 9 uses a dynamic program- 
ming algorithm. 

Several other known methods for protein struc- 
ture comparison are not based on generalizations of 
sequence alignment algorithms, but use a variety of 
iterative schemes to optimize superposition. 10-19 
These methods have been extensively used for 
(closely) related structures. However, they each suf- 
fer from one or more of the following drawbacks: (1) 
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j joWinns are difficult to recog- 
large inserts ««»J*J"^ te can be detected; 
nize; (2) only sequential a »>8™J . of a m0 . 

(3) neither ^ «c = f loop 
tif nor spatial Bimdanty i P align . 

ison 1S proposed that overcome h rf 
ate a Agonal plot. 1 his »s a 

° n P r rS See final optimization of the set 
nates. 

METHODS 
From Sequence Alignment to 3-D Alignment 

The point of departure of our algorit 

■*f" or , del f"™; dfiical .equence alisnment 
tW ^ W rXfofTndtrSovLl optimal P»th 

maintained. Sequential oru an d A2/B2, 

if the fragment* occur m l J Qr A1>A2 

proteins, i.e., either AKA2 and a l 

La R1 >B2 but not in mixed order like Al< a/ anu 

fif>B2 or A1>A2 and BKB2, where < means 

« before" in sequence and > means "comes 

afw » See Figure 1A for an example. 

af in 3-D alignment, an optimal sum over fragmen 



duster is similar to that of Bi and the B Jhgjjto 

;=/or^=SS 

in SnTc C all y . similarity of spatial * ^ 

be evaluated either in terms of 

position or in ^%*SZ£T^ 
example one could simply a B1 + B2 

superposition of M + A2 as o p mean 

comparfa set of alpha-carbon a^Tfj^ 
■ + A2 with an equivalent set within Bl + B2. A 

the production of the diagonal plot. This is our ■ y 
tne piu" the sl mUarity 

technical point. The idea is w * 

given in the next three subsections. 



Diagonal Plot 

ThP first step in the creation of the diagonal plot s 

See-dimensional superposition » order to ehmi 
^SSSST-r fragment geometry in terms of 

JerLl alpha^n *^^*^t 
sim ilar to that of Jo nes ^ ^hirup ex^ ^ 
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Fig. 1 Schematic example of comparing two alpha-helical pro- 
tein structures, A and B, in which helix A1 is similar to helix B1 , A2 
to B2 and A3 to B3. A. The cores of the two structures superpose 
well, even though the interhelical connections are different. B. 

ment are used. Using more than five distances did 
not appreciably improve the selectiveness of the fil- 
ter. A pair of fragments is rejected if two equivalent 
alpha-carbon distances differ by more than a speci- 
fied cutoff. To be sure that no pair will be rejected 
spuriously, this cutoff should be set at two times the 
maximum acceptable coordinate error after 3-D su- 
perposition of the fragments. The length of the 
shortest fragments compared is normally set at 10 to 
15 residues. Using shorter fragments gives rise to. an 
excessive number of matched fragments; using a 
longer minimum length tends to reduce the number 
of hits unacceptably. 

In a second step, pairs of fragments that are not 
rejected by the distance geometry filter are super- 
posed using the least-square algorithm of Kabsch. 6 
This is straightforward as the two fragments in a 
pair have the same length. For each fragment pair, 
the goodness of fit is evaluated in terms of the root 
mean square distances, Drms, and the largest dis- 
tance, Dmax, between equivalent alpha-carbon at- 



Helices A1 + A2 superpose well on Bt + B2. The third fragment 
pair, A3/B3, cannot join the cluster made by the other two helix 
pairs because the third helix has a different orientation in the two 
proteins. 



oms, after optimal superposition. Two fragments are 
considered to be sufficiently similar if these distance 
values are below specified upper limits, typically 2.0 
A for Drms and 3.8 A for Dmax (tighter limits 
should be used for very similar proteins). 

In a third step, accepted fragment pairs are elon- 
gated: one residue is added at the C-terminus of both 
fragments and the longer fragments are again su- 
perposed. This process is repeated until the next ad- 
dition would lead to violation of the upper limits. 
Additional computer time is saved by avoiding the 
comparison of fragments that are entirely helical, as 
every helix always fits every other equally long or 
longer helix: fragments are only compared if they 
contain at least four non-helical residues. Secondary 
structure assignments are taken from the DSSP 
dictionary, 21 Also, fragment pairs are skipped that 
would be subsets of already stored fragments. To- 
gether, these three empirically developed steps pro- 
duce very useful diagonal plots as input to the clus- 
ter analysis, with great economy of computer time. 
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Cluster Analysis 

In order to assemble pairs of fragments into a pair 
of larger unL, we use a simple incremental cluster- 
ng "focedure. In each step, one needs to determine 
if a pair of fragments can join a cluster. The most 
olauSe way wTuld be to simply add the fragment 
Sr to the cluster and to perform a complete 3-D 
Cemosition on the entire set of equivalenced res- 

j t? To see if the positional errors are less than 

lw process, so a filter that determines if a pair of 
faints could potentially join 
needed The filter we use assesses whether the frag 
men ^ pairs Al/Bl and A2/B2 can be Joined into a 
mem pm as a d body> 

Ssu^r sed'w^ll^oBl + B2(Fig. 1). This is 
calculated during the generation of the diagonal 

Pl First we check that the distance between the cen- 
ters of mass of Al and A2 is similar to the distance 
between the centers of mass of Bl and B2_ If these 
intraprotein distances are too dissimilar, then the 
.„„„„. u e a eood superposition of Al + A-i onto d 
Tb iond the'rototion matrix of the optimal 

ThlTs done by multiplying one superp* ^rote 
lion matrix (Rl) with the inverse of the other UU> 
A nT an ifvL the departure from the unit matrix 
S fJZ ^onhe 8 rlsulting net rotation angle 6 given 

by 

cos 8 = ^-{tracdRi • ^2 *) ~ 1 1 • 



The discrepancy angle 8 is equal to zero if Rl and 
R2 represent identical rotations. If 8 is above a cutoff 
value typically 0.2-0.3 radians, the two rotation 
Ire considered dissimilar and the fragments cannot 
be merged The rationale behind this is as follows, if 
proteins are perfectly superposable, then every 
nair of equivalent substructures is also perfectly su- 

po able, with the same rotational 
fhe superposition transformation; deviations from 
^rfersuTerposability can be 
deviations in the rotational component Because me 
tZ reasoning does not hold for the translation^ 
Zonent 7tL transformation, we use the vector 
Seen the centers of mass, as described above. 
Torder to determine the largest clusters) or a 
given protein pair, each pair of fragments should be 
used in turn to start a new clustering process. In 
g neral, this implies N 2 comparisons 
mTnt pairs given N pairs of fragments in the diag- 
r WXlnW it is often -ti« 
minate the search as soon as a sufficiently large 
duster is found, say, exceeding the size of a minimal 
fi5E unit O 40-50 residues) or, say, containing 
S of all residues in one of the protein, If the two 



proteins have a measurable degree * f^ 4 *^ 
likely that the fragment pairs near 
nal will provide the largest cluster. Therefore in 
Drawee we search for clusters along the main diag^ 
Tnal tSt and terminate on cluster size, reducing the 
compSty of the clustering procedure from order 
N 2 to order N. 

Final Adjustment and Equivalencing 

Creation of the diagonal plot and clustering of 
pa^s Xment in principle solves U-P""^ 

in the final superposition, such short tragmems 
ou £ be i nteresLj provided they fit into the over- 
all context; (2) in the clustering ; procedure a new 
fragment pair was only compared with the starting 
member of the cluster, so one is not yet sure that the 

distance criteria are fulfilled for the entire c uster 
distance cm ^ addltlona l 

Laments thaTare 6 part of the cluster only if their 

Jte anf fine-tuned with an iterative procedure 
similar in part to that of Rao and Rossmann. First, 
Z fragments are optimally superposed such that 
Drms the average positional deviation, is minimal. 
? u b S eque e nUy, the Lt of equivalence^ residues is 
rp examined and adjusted according to the criteria 
d ; c ussTbelow and a new overall transformation 
and Sms are determined. The P-e-^er tod 
until no further adjustment is required. Th s termi 
S n condition is' normally fulfilled within si* £ 
nine cycles. In the equivalencing pass of the fina 
Optimization a pair of residues »«*^ W J£* 

Dmax of each other and; U) n tne pair ui 
paTof two consecutive stretches of ---aUength 
lay 5 residues), acceptable according to (1). Op ion 
ally fragments are allowed to run sequentially in 
oopos to directions. The final cluster is reported af- 
final superposition as a list of equivalenced 
residues, i.e., as the structure-derived sequence 
alignment. 

RESULTS 

As a test of the method, several well-known com- 
parison^ redone: two hemoglobin chains, p as- 
oc antazurin, and the two domains .of rhodane- 
In addition, we report discovery of a. unexpec ed 
tructural similarity: ubiquitin-ferredoxin^ For the 

alpha and beta chain ^J^* 
alignment agrees with that of Lesk a 

For plastocyanin^ azur.n (I 3), our 
agrees with the alignment by Adman. A known 
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Fiq 2. Human deoxyhemoglobin beta chain 
Stereo view. N terminus and C terminus labelled 



(dashed lines) superposed on the alpha chain (solid lines). 





are indicated by crosses. Stereo v.ew. 



example of internal duplication is bovine liver 
rhodanese. 1 This molecule is composed of two struc- 
teally similar domains, with no detectable ^se- 
quence similarity between them. The cores of these 
So domains are almost identical, but the loop re- 
ins vary in length. The fragment match Agonal 
STfJig 4). comparing the first with the second 
domain has traces near the main diagonal that can 
be merged into one large cluster, correspond ng to 
the superposition of the two domains (Fig. 5). The 
derived alignment is essentially identical to the one 
d^mhX the Ploegman et al > ^with at ^most one 
residue more or less equivalenced at the ends of 

^Slatabase scan turned up several new struc- 
tural similarities. One example is the pair 
ubiquitin 28 / ferredoxin 27 (Fig. 6) Ubiquitm a 76 
ZL protein, is involved in protein breakdown 
via covalent conjugates, whereas ferredoxm with 98 
residues, functions as an electron carrier in the pho- 



293 




ci« a nianonal dot of fragment similarities in rhodanese, be- 

{[ace wrresponds to a fragment pair, which may or may not tt with 
the overall domain comparison. 





Fig. 6. A. Ferredoxin (dashed, 1 UBQ) superposed on ubiquitin 
(solid tines, 3FXC). Stereo view. Two loops in ferredoxin for which 
no equivalent loops are present in ubiquitin are removed (top'left), 
and replaced by thin dashed lines, for clarity. The superposition is 
generated by superposing the following fragments, equivalenced 
by the algorithm (lubq range / 3fxc range): M1-L8/Y3-E10; 



T21-A28/N15-E31; G47-S57/G74-T84; E64-V70/D86-H92. B. To- 
pology scheme for ferredoxin and ubiquitin. Circle: alpha helix; 
squares: extended strands. The alpha helix lies across the open 
hand formed by the beta strands 1 ,2,3,4, and 6. Strands 2, 1, 6, 3 
and 4 form a sheet in which the irregular strand 5 does not par- 
ticipate. The two domains have exactly the same topology. 



toreduction of cytochrome c. Surprisingly, the three- 
dimensional structures are remarkably similar. The 
overall rms deviation of 47 out of maximally 76 
equivalenced alpha carbon atoms is 2.1 A. Both 
structures can be described as a hand of five beta 
strands holding a short beta strand and an alpha 
helix in the center. There is no obvious analogy of 
protein function and, apparently, the structural sim- 
ilarity had gone undetected. Perhaps ubiquitin and 
ferredoxin do have a common ancestor. Alterna- 
tively, the ferredoxin and ubiquitin "beta-grasp" do- 
main may be an energetically favored folding unit. 



CONCLUSION 

Our algorithm provides a novel tool for the com- 
parison of protein structures with the options of al- 
lowing for altered loop topology and for reversal of 
chain direction. The entire procedure is fully auto- 
matic and can be used in a routine manner. The 
method is so fast that the comparison of one single 
structure with all known structures is possible with 
only a few hours CPU usage on a workstation. Large 
insertions or deletions or many insertions or dele- 
tions are no problem. The method can be used in any 
context where structural alignment is useful, e.g., to 
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determine reliable (structure based) sequence align- 
ments, to aid in the definition of structural cores of 
protein families, and to find common three-dimen- 
sional folding units. 

The method is implemented as an option in the 
molecular modeling and drug design program 
WHAT IF, 28 facilitating immediate visualization by 
computer graphics. WHAT IF is written in FOR- 
TRAN 77, with graphics drivers for Evans and 
Sutherland and Silicon Graphics computers. The 
program is available from G.V. for a minimal fee. 
Send electronic mail to VRIEND@EMBL-Heidel- 
berg.DE on internet for information. 
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Abstract 

The structure and dynamics of the N-terminal activation domains of the yeast heat shock transcription factors 
of Kluyveromyces lactis and Saccharomyces cerevisiae were probed by heteronuclear ,5 N( *H) correlation and 
,5 N( 'HI NOE NMR studies. Using the DNA-binding domain as a structural reference, we show that the protein 
backbone of the N-terminal activation domain undergoes rapid, large-amplitude motions and is therefore unstruc- 
tured. Difference CD data also show that the N-terminal activation domain remains random-coil, even in the pres- 
ence of DNA. Implications for a "polypeptide lasso" model of transcriptional activation are discussed. 
Keywords: activation domains; dynamics; heat shock factor; NMR; unstructured 



Transcription factors orchestrate the regulated production of 
key proteins in eukaryotic cells during development and in re- 
sponse to extracellular stimuli. Typical transcription factors con- 
sist of a DNA-binding domain, an oligomerization domain, and 
one or more activation domains (Tjian & Maniatis, 1994). The 
structural basis for the function of the DNA-binding and oligo- 
merization domains have been well characterized (Nelson, 1995); 
however, similar information for the activation domains remains 
scarce (Triezenberg, 1995). Many transcriptional activation do- 
mains fall into one of three categories on the basis of their pre- 
dominant amino acid compositions: acidic, proline-rich, or 
glutamine-rich (Mitchell & Tjian, 1989). Fusion proteins pro- 
duced by joining these domains with heterologous DNA-binding 
domains have exhibited transcriptional activation (Tjian & 
Maniatis, 1994). Several recent studies conducted on the "acidic" 
activation domains of Vmw65 protein of herpes simplex virus 
(Donaldson & Capone, 1992; O'Hare & Williams, 1992), NF-kB 
p65 (Schmitzetal., 1994), and the?T core of the human gluco- 
corticoid receptor (Dahlman-Wright et al., 1995) concluded that 
these domains lacked well-defined structure. These results were 
based essentially on negative information: a lack of long-range 
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NOEs and poor proton chemical shift dispersion in 'H NMR 
spectra. 

In this report, we use information on the dynamics of the pro- 
tein backbone obtained from l5 N( ! HJ heteronuclear NOE, 
along with CD experiments, to characterize the structures of the 
N-terminal transcriptional activation domains of the heat shock 
transcription factor (HSF) from two species of yeast. HSF is an 
inducible transcriptional activator that regulates the expression 
of the heat shock proteins when eukaryotic cells are exposed to 
elevated temperatures or other environmental insults (Lis & Wu, 
1993; Morimoto, 1993). HSF contains a DNA-binding domain, 
a trimerization domain, and one or two transcriptional activa- 
tion domains. In yeast, HSF is constitutively bound at heat 
shock elements containing the sequence 5'-nGAAn-3' (Amin 
et al., 1988; Xiao & Lis, 1988) and acts as a transcription fac- 
tor even under nonstressed conditions (Jakobsen & Pelham, 
1988; Sorger & Pelham, 1988; Wiederrecht et aL, 1988; Gross 
et al., 1990; Jakobsen & Pelham, 1991; Chen et aL, 1993). Func- 
tional HSF is required for yeast viability at all temperatures 
(Sorger & Pelham, 1988; Wiederrecht et al., 1988; Gailo et al., 
1993). Heat shock or stress conditions effect a higher level of 
transcriptional activation, which is mediated through mecha- 
nisms that are currently being investigated (Nieto-Sotelo et aL, 
1990; Sorger, 1990; Gallo et al., 1993). 

Biochemical and genetic experiments have been used to map 
the transcriptional activity of yeast HSF to distinct regions of 
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the N- and C-termini (Nieto-Sotelo et al., 1990; Sorger 1990- 
Jakobsen & Pelham, 1991) (Fig. 1). These activation domains 
could not be grouped into any known classes by their amino acid 
content, nor could they be categorized into any known folding 
motifs by structure prediction programs. The C-terminal acti- 
vation domains from Kluyveromyces lactis and Saccharomyces 
cerevisiaetxhibit high levels of constitutive activity when fused to 
a heterologous DNA-binding domain (Nieto-Sotelo et al 1990- 
Sorger, 1990; Jakobsen & Pelham, 1991). The intact C-terminal 
activation domains are too large (300+ residues) for practical 
NMR studies. Although the residues most involved in activation 
in the C-terminal domain have been mapped to a much smaller 
segment (amino acids 592-623), its activity is clearly modulated 
by other regions of the protein (Chen et al. , 1 993), making this 
segment a questionable target for structural studies. 
- Deletions** the entire C-terminal region result in a functional 
HSF at physiological temperatures (Nieto-Sotelo et al 1990* 
Sorger, 1990); this implies that the N-terminal activation domain 
is sufficient for the required constitutive level of transcriptional 
activity. In fact, the N-terminai activation domains from both 
yeast strains also show a low level of transcriptional activity 
when fused to a heterologous DNA-binding domain (Nieto- 
Sotelo et al., 1990; Sorger, 1990). The smaller size of the 
N-termmal activation domain (170-190 residues) makes it suit- 
able for study by NMR. In addition, the N-terminal region is 
flanked by the well-studied and stable DNA-binding domain 
(Damberger et al., 1994; Harrison et al., 1994). 

Using two constructs that contained both the N-terminal acti- 
vation and the DNA-binding domains from K. lactis and S ce- 
revisiae, we were able to compare the structure and dynamics of 
the N-terminai activation domain against an internal structural 
control, the DNA-binding domain. »N|'H| heteronuclear sin- 
gle quantum coherence (HSQC) spectra showed that the chem- 
ical shift dispersion of the 'H resonances of the activation 
domains was poor, typical of that observed in denatured pro- 
teins (Neri etai., 1992; Shortle & Abeygunawardana, 1993). In 
addition, we used two-dimensional heteronuclear ,5 N('H) 
NOE NMR and measurements of ,5 N relaxation parameters to 
show that the N-terminai activation domains from both yeast 
strains have a high degree of flexibility, which is consistent with 
an unstructured state in solution. The results are particularly 
compelling because they offer the first positive evidence for a 
dynamically disordered transcriptional activation domain 



Results 



Dynamics probed by heteronuclear relaxation 

Heteronuclear relaxation studies provide a residue-specific probe 
of the conformational dynamics of proteins (Kay et ai., 1989- 
Palmer, 1993). Current NMR approaches couple dynamic mea- 
surements with structural information to gain insight into pro- 
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tein function. Typical heteronuclear relaxation studies result in 
the determination of order parameters for backbone amide 
N-Hs (Clore et al., 1990; Barbato et al., 1992; Redfield et al 
1992; Cheng et ah, 1993; Farrow et al., 1994; Zinketal., 1994)! 
Order parameters, which define the amplitude and range of lo- 
cal molecular motions, are obtained using data acquired from 
a suite of three heteronuclear relaxation NMR experiments: 
N Tl longitudinal relaxation, 13 N T2 transverse relaxation 
and heteronuclear ,5 N{ >H) NOE. In a recent example, Bax and 
coworkers applied such studies to support the "flexible tether" 
model for calmodulin function (Barbato et al., 1992). 

The heteronuclear l5 N{ 'H| NOE provides ^qualitative as- 
sessment of the mobility of N-H bond vectors for individual res- 
idues. It is sensitive to both the overall tumbling time of the 
protein (r m ) and fast internal motions. Maximal NOE values 
(+0.83) occur in the slow-tumbling limit (u N T m » 1), indicat- 
ing the N-H bond vectors reorient with the overall tumbling of 
the molecule. Minimal NOE values (-3.5; assuming isotropic 
rotation, 1 N resonance frequency of 60.8 MHz, and 1:02 A 
N- H bond length) occur in the fast-tumbling limit (w N T m < 1) 
and are indicative of rapid, large-amplitude motions with respect 
to the overall tumbling of the molecule (Palmer, 1993). 

Although heteronuclear NOE data alone are not' sufficient for 
lully quantifying the dynamic behavior of molecules, they can 
be used as a probe for assessing the structural state of a protein. 
In one recent example, heteronuclear NOE data were particu- 
larly enlightening when comparing the backbone flexibilities be- 
tween the folded and unfolded states of the drkN SH3 domain 
(Farrow et al., 1995). The data showed dramatic differences in 
dynamic behavior between the folded and unfolded states: the 
structured, folded state exhibited positive NOEs, and the un- 
folded state showed primarily negative NOEs. 



Fig. i. Schematic representation;; of K. lactis and S. cerevisiae HSF. 



Dynamic variations within the K. lactis 
HSF DNA-binding domain 

Most of the cross peaks in the 15 N| 1 H ) HSQC spectra of the 
K. lactis HSF DNA-binding domain are well resolved. The spec- 
trum for the heteronuclear NOE version of this experiment is 
shown in Figure 2A. Assignments were based on the previous 
NMR study (Damberger et ai., 1994). Calculated NOE values 
shown in Figure 3A, have an average of 0.67 (0.81 average in 
regions of secondary structure). The heteronuclear NOEs for the 
structured regions of the DNA-binding domain are positive and 
near the slow tumbling limit, as opposed to the C-terminal res- 
idues His-9I and AIa-92, which show negative NOEs indicative 
of rapid, large-amplitude motions. Areas of intermediate NOEs 
reflect an increase in dynamic flexibility. The best example of 
this can be seen for the LI loop (residues 69-83). Previous struc- 
tural work had suggested that there was higher mobility in this 
loop region, based on the lack of long-range contacts in the 
NMR structure (Damberger et al., 1994) and the lack of elec- 
tron density for residues 76-79 in the crystal structure (Harri- 
son et al., 1994). The heteronuclear NOE gives direct evidence 
that this region has a higher degree of dynamic flexibility with 
respect to the overall tumbling of the protein. 

Structural assessment of the S. cerevisiae 
HSF DNA-binding domain 

There are no published structures of the DNA-binding domain 
of S. cerevisiae; however, its high sequence homology (73% 
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Fig. 2. Heteronuclear 15 N| 'Hj NOE spectra for K. lactis HSF constructs. Positive intensity peaks are denoted in black; neg- 
ative intensity peaks are denoted in red. A: DNA-binding domain construct. Cross peaks are labeled with their amino acid as- 
signments (Damberger et al., 1994). B: Construct of the N-terminal activation domain plus the DNA-binding domain. 



identity) (Devereux, 1991) and identical DNA-binding specific- 
ity to the DNA-binding domain of K. lactis HSF (Jakobsen & 
Pelham, 1991) strongly suggest that it has the same fold (Hubl 
et al., 1994). Indications that the DNA-binding domain of S. ce- 
revisiae HSF is structured could be seen from the well-dispersed 
nature of the 15 N( l H] HSQC spectra of this domain (data not 
shown) and from the similarity of the secondary structure to that 
of the K. lactis DNA-binding domain, as evidenced by far-UV 
CD experiments (data not shown). The presence of structure in 
this domain was verified with heteronuclear NOE experiments. 
Most of the calculated NOEs are positive and fall within the slow 
tumbling limit (data not shown). The average NOE value was 
0.63 for the resolved peaks, just as in the K. lactis equivalent. 
Because of the well-behaved nature of this domain, we were able 
to use it as an internal structural control for analysis of the 
N-terminal activation domain of 5. cerevisiae HSF. 



N-terminal activation domain of 
yeast HSF is unstructured 

The heteronuclear NOE spectrum of the AT. lactis HSF 
N-terminal activation domain plus DNA-binding domain is 
shown in Figure 2B. The similarity of the peaks in Figure 2A 
with the positive peaks in Figure 2B facilitated the transfer of 
peak assignments. Of the 85 N-H signals previously assigned 
(Damberger et al., 1994), we were able to clearly identify 75 in 
our spectra. The respective NOE values for these peaks (Fig. 3B) 
correspond very well with the data for the DNA-binding domain 
alone (Fig. 3A). It appears that the backbone dynamics in most 
of the DNA-binding domain are not affected by the presence 
of the N-terminal activation domain. 

The remaining peaks in Figure 2B, all negative in intensity, 
can be attributed to the N-terminal activation domain. These ad- 
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ditional peaks have poor shift dispersion in the proton dimen- 
sion, as is the case for unstructured proteins. Integrating the 
volumes of the signals in the spectra could account for approx- 
imately 183 of the possible 192 residues; however, only 103 peaks 
were sufficiently resolved to calculate NOEs. These NOE val- 
ues (Fig. 4) are sorted by proton chemical shift because there 
are no sequence-specific assignments currently available for this 
domain. Negative values indicate that all of the residues in this 
region exhibit a high degree of rapid internal motion, typical for 
an unstructured protein. Rapid motions of the activation domain 
also affect the dynamics of the N-terminus of the DNA-binding 



domain. Arg 2 in the DNA-binding domain alone showed a 
small positive NOE (Fig. 3A), but the presence of the highly dy- 
namic activation domain in the fusion construct of the N-terminal 
activation domain plus the DNA-binding domain seems to drive 
the corresponding residue (Arg 194) into a rapid motion regime 
(Fig. 3B). 

Identical experiments were performed on the comparable con- 
struct for S. cerevisiae HSF. Despite the lack of residue-specific 
assignments for the DNA-binding domain, we were still able to 
identify the corresponding peaks in the full-length construct 
by comparison with spectra of the DNA-binding domain alone. 
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Fig. 4. NOE data for the activation domain peaks of the construct of the N-terminal activation domain plus the DNA-binding 
domain. Histogram sorted by proton chemical shift. 
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Again, the calculated NOEs fell predominately in the slow tum- 
bling regime. The remaining signals, attributable to the N-terminal 
activation domain, all had negative NOEs indicative of rapid, 
large-amplitude motions. This shows that the N-terminal acti- 
vation domain of 5. cerevisiae also behaves as an unstructured 
protein. 

To confirm that the N-terminal activation domains were not 
unstructured because of denaturation at low pH, further NMR 
spectroscopic studies were performed at pH 5.75 (data not 
shown). Comparison of the chemical shift distribution in spec- 
tra acquired at higher pH with those at lower pH indicated no 
change in the structural state of the constructs. The DNA- 
binding domains remained folded, whereas the activation do- 
mains persisted in an unstructured state. 

Far-UV CD spectroscopy, performed at pH 7.0, also indicate 
that the N-terminal activation domains of yeast HSF are un- 
structured. Figure 5A shows CD spectra for the DNA-binding 
domain, and for the N-terminal activation domain plus the 
DNA-binding domain, as well as the difference between the two. 
The difference spectrum is characteristic of a random coil. We 
also examined the possibility that the N-terminal activation do- 
main might behave differently when the DNA-binding domain 
is bound to a DNA-binding site (Fig. 5B). An examination of 
this CD difference spectrum indicates that binding of the DNA- 
binding domain to the DNA does not induce any structural 
changes in the N-terminal activation domain. Identical results 
were obtained for the K. lactis constructs (not shown). 

Discussion 

Progress is being made in the effort to understand the mecha- 
nism of transcriptional activation. Structural studies of proteins 
often reveal important aspects of their function and are a power- 
ful complement to biochemical and genetic studies. Heteronu- 
clear relaxation NMR spectroscopy can be used to characterize 
the behavior of unstructured or partially structured protein 
states in solution. Using heteronuclear NOE experiments, we 
have been able to show that the N-terminal activation domain 
of yeast HSF undergoes rapid local motions and is unstructured 
in solution . It is possible that the structural state of the N-terminal 
activation domain might be affected by the presence of the de- 
leted C-terminal regions. However, this seems unlikely because 
the N-terminal activation domain is functional when fused to 
heterologous DNA-binding domains (Nieto-Sotelo et al., 1990), 
and C-terminal deletion constructs are sufficient for cell viabil- 
ity (Nieto-Sotelo et al., 1990; Sorger, 1990). 

There have been several previous studies of functional frag- 
ments of acidic activation domains, varying in both size and 
source. The consensus of these studies is that these peptides 
alone in solution have little secondary structure, although data 
suggest that either a helices (Donaldson & Capone, 1992; 
Schmitz et al., 1994; Dahlman-Wright et al., 1995) or 0 sheets 
(Leuther et al., 1993; Vanhoy et al., 1993) can be induced un- 
der specific solvent conditions. It has then been concluded that 
these conformations might be important in the function of the 
domains, although there is no direct evidence for this. Models 
have also been presented that suggest that activation domains 
interact through amphipathic helices (Giniger & Ptashne, 1987), 
but these models have limited support from either genetic or bio- 
chemical data. It has been clearly shown, however, that specific 
patterns of acidic and hydrophobic residues are required for ac- 
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Fig. 5. Far-UV CD spectra of S. cerevisiae HSF constructs. A: Protein 
only. B: Protein and DNA. DBD, DNA-binding domain construct; 
AD + DBD, construct of the N-terminal activation domain plus the 
DNA-binding domain; AD (diff.), activation domain difference spectra. 



tivity of acidic activation domains (Cress & Triezenberg, 1991; 
Regier et al., 1993), although this does not seem to hold for glu- 
tamine- or proline-rich domains (Gerber et al., 1994). The stud- 
ies presented here are the first on a general sequence activation 
domain (not falling into acidic, proline-rich, or glutamine-rich 
classes), and also the first on a complete activation domain. The 
NMR data clearly show that this domain is highly disordered, 
with internal motions on a very rapid time scale. We have 
reached somewhat similar conclusions about an activation do- 
main fragment from c-jun (unpubl. data), which again does not 
fall into any of the classes noted above. These data, taken to- 
gether with previous studies of acidic activation domains, argue 
strongly that the disorder observed is in fact a general property 
of activation domains of all classes. 

It is not obvious how an unstructured activation domain 
might interact with the components of the general transcriptional 
machinery. An "induced fit" model proposes that some proteins 
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may function by acquiring structure in the presence of target fac- 
tors (Koshland, 1958). Induced fit structural changes have been 
found in systems such as b-ZIP binding to DNA (Talanian 
et al., 1990), calmodulin binding to target proteins (Ikura et al., 
1992), and staphylococcal nuclease binding to inhibitor (Hynes 
& Fox, 1991). However, these examples are of proteins that al- 
ready possess some degree of structure in their w apo" forms. The 
structural changes induced in these proteins by their interacting 
partners are relatively small and are localized around the sub- 
strates. No conclusions can be drawn about the structure of ac- 
tivation domains in their activated complexes until further data 
is gathered, but it would be surprising to find that systems as 
large and dynamically disordered as activation domains could 
become completely folded via an induced fit mechanism. 

With respect to this, we suggest that activation domains only 
serve as "polypeptide lassos" that rein in and increase the local 
concentrations of the factors needed for activated transcription. 
Evidence for this type of role can be drawn from reports indi- 
cating that activation domains interact with several different 
factors of the general transcriptional machinery (Tjian & Mani- 
atis, 1994). The localization of these factors could stimulate the 
formation of the transcriptional complex and thereby initiate 
transcription. The exact nature of this recruiting process is still 
not clear, but some insight can be gained from examining the 
biochemical data on the interaction of acidic domains with the 
general transcription factor TFIIB. Recently published results 
show that acidic activators interact with the C-terminal region 
of TFIIB to recruit this factor into the preinitiation complex 
(Roberts & Green, 1994). However, the protein-protein inter- 
action that mediates this and similar events are unlikely to be 
highly specific in nature. Experiments using heterologous tran- 
scription factors formed by a fusion of GAL4 or GCN4 DNA- 
binding domains to random polypeptides coded by fragments 
of genomic Escherichia coii DNA found the polypeptides that 
provided the highest levels of enhanced transcription to be rich 
in acidic side chains but to possess no other obvious sequence 
similarity (Ma & Ptashne, 1987). This reinforces the notion that 
nonspecific interactions can play a large role in the process of 
activated transcription, a generalization of the "acid blobs and 
negative noodles" idea of Sigler (1988). It would then not be too 
surprising if other activation domains recruited their factors in 
a similar, nonspecific fashion. Once needed factors are gathered, 
their close proximity to each other could stimulate the assem- 
bly of the active transcription complex. Similar proposals have 
been outlined by Frankel and Kim (1991). 

Thus, it may not be necessary for activation domains to form 
any well-defined and unique structure. The idea of activation 
domains functioning as "polypeptide lassos" better supports the 
impression that a small number of transcription factors are able 
to regulate the production of thousands of proteins (Tjian & 
Maniatis, 1994). Their lack of a well-defined and unique struc- 
ture could make activation domains more versatile in their abil- 
ities to recruit the various factors necessary for formation of the 
transcriptional complex. 

Materials and methods 

Cloning 

The HSF activation domain plus DNA-binding domain con- 
structs were cloned from plasmids developed previously (Flick 



et al., 1994; Harrison et al. f 1994). The plasmid pHN200, which 
contains the coding sequence for the K. lactis HSF, was modi- 
fied by site-directed mutagenesis with two oligomers to insert 
an Ndel site at the N-terminus and an Sph\ site at amino acid 
284. This restriction fragment was subcloned into Ndel/Sphl 
digested pHNI04, a T7-driven expression vector. The S. cere- 
visiae construct was made in a similar manner from plasmid 
pHF153, which codes for the first 259 amino acids of HSF. It 
already contained the correct Sphl site at residue 259. An Ndel 
site was introduced at the N-terminus with site-directed muta- 
genesis, and again the Ndel/Sphl fragment was subcloned into 
pHN104 for expression. 

Sample preparation 

Plasmid constructs were transformed into E. coii (strain BL21 
(DE3)/pACYC) (Flick et at., 1994; Harrison et al., 1994). Uni- 
form 15 N-Iabeled protein was obtained by growing the bacte- 
ria on M9 minimal medium with [ I5 NJNH 4 C1 (CIL) as the sole 
nitrogen source. Induction occurred at OD 594 = -0.6 with 1.2 g 
IPTG/IL, and the cells were harvested 4 h later by centrifuga- 
tion at 4,000 rpm, 4°C, in a Sorvall GS3 rotor. Protein purifi- 
cation was accomplished by a simplified version of a previously 
reported method (Harrison, 1994). Cells were resuspended in 10 
mL/g cells isotonic wash buffer, spun down, resuspended in 1 M 
NaCl lysis buffer, and lysed by sonication for 6 min in a dry 
ice/ethanol bath. A high-speed spin was used to get rid of cellu- 
lar debris, and the resultant supernatant was diluted to 100 mM 
NaCl and loaded onto a pre-equili orated heparin column. Prior 
to elution, the column was washed with a low-salt buffer. The 
desired protein was eluted with 500 mM NaCl. PAGE (15%) was 
used to assess the purity of fractions collected. Electrospray- 
ionization mass spectroscopy confirmed >98 ff /o l5 N incorpora- 
tion. Fractions containing >95<Vo of the desired protein were 
dialyzed into NMR buffer to yield final 0.5-mL samples of 
-3-5 mM protein in 10 mM KH 2 P0 4 , 90% H 2 O/10% D 2 0, at 
pH 3.4 or pH 5.75. 

NMR spectroscopy 

15 N|'H1 heteronuclear NOE spectra were measured on a 
Bruker AMX-600 instrument as described (Barbato et al., 1992). 
Spectra were initially acquired at pH 3.4 to minimize proton ex- 
change and maximize the signal-to-noise ratio. The 8-k data 
points were collected in the t2 dimension to increase resolution. 
Data were processed with the NMRPipe suite of programs 
(Delaglio, 1993), and peaks picked with the CAPP/PIPP suite 
of analysis programs (Garrett et al., 1991). Spectra at pH 5.75 
were collected with the PEP-Z-HSQC experiment (Akke et al. , 
1994). Data were processed with the Felix processing package 
(Biosym Technologies). 

CD spectroscopy 

Far-UV CD spectroscopy was performed on an AVIV Model 
62DS equipped with a temperature-controlled cell holder and 
connected to an IBM-compatible workstation for data analysis. 
Samples were prepared as 1 mg/mL solutions in 25 mM sodium 
phosphate buffer, pH 7.0, at 25 °C. Concentrations (2:1) of 
DNA(5'-CCGGTGAATTiTCTTGAATGGCC-3 / ):protein were 
used for the protein/DNA experiments. Difference spectra were 
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obtained using Microsoft Excel. A three-point moving average was 
used for data smoothing. 
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2 Rule-Based Approach 



Studies on calcineurin (CaN) [9] sparked the present work; we noticed that the long disordered region (LDR) 
in this protein has a low content of aromatic amino acids (Trp, Tyr, Phe). Several other disordered regions 
were found to have this same characteristic. This makes structural sense because the side chains of aromatic 
amino acids have strong and specific interactions [3] and so would be expected to induce structure and inhibit 
disorder. Using CaN as the prototype, the average fraction of aromatic amino acids was calculated over a 
window of 31 amino acids surrounding each sequence position. The aromatic content dropped significantly 
in both of the longer unobserved regions in CaN (Fig 1). 

The following prediction rule was developed from these observations: (a) for a given protein, the average 
content of aromatic residues is calculated throughout the amino acid sequence, as explained above; (b) if 
there is a contiguous region of more than 80 sequence positions with an average content of aromatic residues 
below 6.5%, the protein is predicted to have an LDR. This predictor was intended for LDRs like the one in 
CaN; because of the large window sizes for this predictor, it is not suitable for predicting M- or SDRs. 

3 Neural Network Approach 

The rule-based predictor discussed in the previous section was developed based on information from a 
single protein, CaN, which served as the prototype for our studies. An alternative is to design feedforward 
neural network predictor trained using the backpropagation learning algorithm [15]. This predictor requires 
construction of a larger set of examples of disordered regions (DRs) and determination of appropriate features, 
as discussed in this section. 

3.1 Disordered Regions Labeled Data Sets 

A search for proteins with invisible regions, which are presumed to be locally disordered, was performed on 
the Protein Data Bank (PDB) at the Brookhaven National Laboratory. This is a public domain archive of 
more than 4,600 experimentally determined three-dimensional structures of proteins. 

Searching PDB for proteins with DRs is a non-trivial task since no standard format for reporting such 
findings is imposed. In addition, several other problems like complexes and repeated sequences further 
complicated this search. In this study, no effort was made to identify all proteins with DRs in the PDB. The 
main objective was to find a sufficiently large set of proteins with confirmed DRs as needed for the design 
of a neural network predictor. 

The PDB search supplied a set of proteins each having at least one DR longer than seven residues. 
These proteins from PDB were supplemented by two additional proteins with DRs (CaN [9] and Bel [12]). 
A histogram of the lengths of the DRs suggested a partition into short, medium and long labeled data 
sets, denoted as SDR (7-21 amino acids), MDR (22-44 amino acids), and LDR (45 or more amino acids), 
respectively. 



3.2 Feature Selection 

The LDR labeled data set was analyzed to identify a pool of properties that discriminate between structured 
and disordered regions. The exploratory analysis considered several attributes measured by averaging over 
windows of consecutive amino acids. Considered attributes included individual amino-acid compositions, 
flexibility [14], hydropathy [11] and hydrophobic moments [5]. 

In addition to the lack of aromatics (in this case just Tyr and Trp) mentioned above, low amounts of 
Cys and His and high amounts of Glu, Asp, Ser, and Lys were also found to be associated with disorder. 
Cys can make special covalent bounds, so its absence in disordered regions is reasonable. Glu, Asp, and 
Lys are charged; charge imbalance would be expected to contribute to disorder. Ser increases solubility 
and provides a flexible locus. Finally, disordered regions would be expected to be soluble and flexible in a 
manner consistent with the findings on hydrophathy and flexibility. Thus, overall, the identified attributes 
seem reasonable for promoting disorder. 

A set of m attributes was identified through this analysis as being more discriminative. For each identified 
attribute, n different features were generated by computing this attribute for n different window sizes, yielding 
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an mx n matrix of features, where each row corresponds to a different attribute and each column to a different 

^Yformai procedure was used to select the most appropriate feature from each row of this matrix The 
method used here is an adaptation of the sequential forward search feature selection technique with the 
minimal error probability selection criterion [7]. A quadratic Gaussian classifier using different covariance 
matrices for each class was used to calculate minimal error probability during the search. 

The standard sequential forward search selection technique is a greedy algorithm that begins with an 
empty feature set and adds features to it one at a time. The first feature added is the one deemed to be 
the best according to the selection criteria. The next feature added is the one which results in the largest 
improvement when considered in conjunction with the first feature. Similarly, the i-th feature added is the 
one that results in the largest improvement when considered in conjunction with the previous i- 1 features. 

In the method used here, when the t-th feature is added to the selected features set, its corresponding 
row is removed from the matrix and the search continues on the reduced (m - ») x n matrix. This prevents 
the same attribute from being selected with more than one window size; the resultmg selected feature set 
contains the most appropriate window size for each attribute. A set of examples for neural network training 
was constructed from the LDR labeled data set using the m selected features. For convenience the same 
features set was used when training on examples from the SDR and MDR labeled data sets. 

4 Results 

The SDR labeled data set contained 38 disordered segments from 34 proteins with 411 disordered amino 
acids and 11,050 total amino acids; MDR set contained 22 disordered segments from 20 proteins with 464 
disordered amino acids and 4,764 total amino acids; and finally, LDR set contained 7 regions from 7 proteins 
with 465 amino acids and 2,069 total amino acids. The result of the exploratory analysis on 24 considered 
attributes was the selection of 10, shown in Table 1. Shown here are the most appropriate windows for 
each of these attributes obtained through the feature selection process discussed in Section 3.2 by exploring 
odd-numbered values ranging from 9 to 21. 

4.1 Prediction Accuracy Estimates 

The rule-based predictor was designed using the CaN knowledge. When tested on a residue-by-residue 
prediction on the remaining 6 LDR proteins, it achieved 70% success rate. This result is surprisingly good 
for the simplicity of the rule and suggests that lack of aromatic amino acids is a strong determinant for the 

development of LDRs. , . 

Balanced sets of the 10 dimensional feature values corresponding to the unobserved and observed ammo 
acids were constructed from the S-, M-, and LDR labeled data sets. These feature sets were each randomly 
partitioned into 5 disjoint balanced subsets from observed and unobserved amino acids. A neural network 
architecture was determined through limited experimentation and a machine with 10 inputs, one hidden 
layer with 6 units, and a single output unit was used for in depth testing. 5-cross validation experiments 
starting from 3 different random initializations of neural network parameters were performed, resulting m a 
total of 15 runs each for the S-, M, and LDR labeled data sets (Table 2, rows a-e). 

Averaging the results from the 5-cross validation experiments gave residue-by-residue prediction accura- 
cies ranging from 66 to 74%. Lumping all length-classes together (ADR) and repeating the 5-cross validation 
experiment led to a drop in prediction accuracy, to about 60%. Finally, when testing each predictor on data 
sets of other length classes, the prediction accuracies dropped to 59-67%. The greatest drop was observed 
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4.2 Prediction of Disordered Regions in Proteins 
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Table 3: Fraction of proteins predicted 
to have DR longer than specific values. 



these observations suggest that DRs, even LDRs, are very common in nature. 



5 Conclusions 

The LDR MDR and SDR predictors were significantly more accurate when applied to the same length 
class Even the process of grouping all disordered amino acids together reduced the prediction accuracy 
substantially (Table 2). These results strongly suggest that amino acid sequence characteristics leading to 
disorder are dependent on the length of the disordered region. 

The current views of protein structure and function still seem to be dominated by the concepts of rigid 
organization and lock-and-key interactions [6], despite many examples of disorder-to-order transitions upon 
binding. As we point out, disorder-to-order transitions upon binding have been found for a d IV ersity of 
molecular interactions that span the biological domain [4]. 

Koshland's induced-fit hypothesis introduced flexibility as an alternative to the lock and key [10j. With- 
out reference to induced fit, Schulz [13] pointed out that the increases in free energy when flexible components 
solidify upon binding enable high specificity without excessively high affinities. Petsko and his collabora- 
tors fll independently showed that loss of flexibility could help prevent excessively tight binding, but failed 
to note the coequally important feature of trading flexibility for specificity. We recently extended Schulz s 
proposal to show that disorder-to-order transitions allow natural selection to operate separately on affinity 
and specificity. We propose that such a separation is essential for the evolution of complex signaling and 
metabolic networks [4]. , , 

If disordered regions are required for the separation of affinity and specificity as we propose, then such 
regions should be very common. The commonness of such regions is fully supported by the findings presented 
herein The next steps will be to determine whether the regions predicted to be disordered do indeed carry 
binding function and to determine whether the predictions are correct. Studies in these directions are 

Und Here a we assume that all invisible regions in the x-ray structures are equivalent; however, three different 
causes have been identified for such regions, including crystal packing disorder, static disorder, and dynamic 
disorder [8] Only the last of these involves the local disorder required by Schulz's proposal, so lumping all 
invisible regions together as we have done may be inappropriate. This is an acknowledged weakness of the 
present study and might be adding noise to the data. We plan to improve our labeled data sets by distin- 
guishing the 3 types of disorder for as many entries as possible, using either literature-based investigations 
or laboratory-based experiments. 

Due to the curse of dimensionality and the small sizes of the DR labeled data sets, our current neural 
network predictors were very simple and limited to just a few features. Yet fairly good predictions are 
evidently being made despite this limitation. Increasing the sizes of DR labeled data sets and repeating the 
feature selection process for each length class will enable us to test a larger pool of candidate features and 
to design more complex predictors. 
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The more complex predictors will, hopefully, give more accurate predictions. This is especially important 
for short disordered segments since none of the current predictors do very well in this region. Since SDRs 
are found frequently to be involved in disorder-to-order transitions upon binding [4], improved predictions 
in this domain are certainly important. 
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Abstract 

Using ordered and disordered regions identified either by X-ray crystallography or by NMR 
spectroscopy, we trained neural networks to predict order and disorder from amino acid sequence. 
Although the NMR-based predictor initially appeared to be much better than the one based on the 
X-ray data, both predictors yielded similar overall accuracies when tested on each other's training 
sets, and indicated similar regions of disorder upon each sequence. The predictors trained with 
A-ray data showed similar results for a 5-cross validation experiment and for the out-of-sample 
predictions on the NMR characterized data. In contrast, the predictor trained with NMR data 
gave substantially worse accuracies on the out-of-sample X-ray data as compared to the accuracies 
displayed by the 5-cross validation during the network training. Overall, the results from the two 
predictors suggest that disordered regions comprise a sequence-dependant category distinct from 
that of ordered protein structure. 



1 Introduction 

Many regions of proteins and some whole proteins form ensembles of structures under native conditions 
in essence lacking a fixed tertiary structure within a given functional domain. Such "disordered"' 
(or "unfolded") proteins have been identified by several methods: 1. sensitivity to proteases- 2 
m f? s ™ ction deasit y in structure determinations by X-ray diffraction; 3. NMR spectroscopy- 
and 4. CD spectroscopy coupled with other methods such as rapid protease digestion, gel exclusion 
chromatography, or survival of function following incubation at high temperatures. 

Disordered regions characterized by the methods described above are often essential for function 
Such regions have therefore been called 'natively unfolded* [51], or 'natively disordered' [20] 'Unfolded' 
implies that the region of protein exists in an extended, flexible (random-coil-like) form, whereas 
disordered includes not only these extended forms but in addition can also imply a collapsed, partially 
folded with secondary structure, but non-rigid (molten-globule-like) form. The disordered ensemble 
oi structures can involve equilibria between random-coil-like and molten-globule-like forms 

Since amino acid sequence determines protein structure [2], we proposed that amino acid sequence 
should determine lack of tertiary structure or disorder as well [43]. To test this hypothesis, we identified 
disordered sequences from missing coordinates in Protein Data Bank (PDB) files and then developed 
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2 Materials and Methods 
2.1 The Proteins 
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PR01-ACACA, 6. Haloalkane Dehalogenase (HDHase), 2edc, HALO-XANAU, 7. Azurin II (Az II), 
lam, AZU2.ALCXX, and 8. Carboxypepitidase A (CbPA), 2ctb, ELIB-PHYCR. 

2.2 Feature Selection 

We use the term 'attribute' to mean a value calculated over a specifies window and the t^J"*™* 
for those attributes that are subsequently used to train the neural networks Sequence attributes are 
numerical values calculated from an amino acid sequence over a specified window [4 - For these studies 
24 attributes provided the initial pool, the first 20 of which are the compos^ons of ^ » «™J> £j£ 
within the specified sequence windows. The last 4 are hydropathy [33], flexibility index [50], helix 
amphipathic moment [21] and sheet amphipathic moment [22]. 

The NMR and X-ray disordered datasets were each matched with an equal number of ordered 
amino acids taken from the NRL.3 [40], which is a subset of PDB containing only or ere ^structures^ 
A feed-forward search with minimal error probability selection criterion was used on the : balanced 
ordered and disordered NMR and X-ray datasets [43]. A quadratic Gaussian classifier^ using different 
covariance matrices for each class was used to calculate the minimal error probability during each 
oraches. Experimentation with other dimensionality reduction methods, such as sequential 
backward search and branch-and-bound, yielded results quite similar to those presented here. Ten 
features were selected from the original pool of 24 attributes. 

2.3 Neural Network Training 

Several possible neural network architectures were investigated in the initial phase of these studies. A 
staple network with 10 inputs, 7 fully connected nodes in a single hidden layer, and one output was 
^Wted as being commensurate with the dataset size and as giving good results [46\. 

The 5-ray and NMR disordered datasets, with their number-balanced datasets of ordered se- 
quences were scrambled in order to separate values from adjacent sequence positions and then di- 
vided into 5 disjoint subsets by random selection. Experimentation indicated 

accuracies were achieved during training whether or not scrambling was used, but scrambling may 
serve to improve predictions for completely unrelated proteins. 

For eaci training cycle, 4/5 of the data comprised the tnuung set and 1/5 the ^ test set. The 
training set was further separated into a proper training set (80%) and a validation et (20%) Three 
nitializations were used and the number of epochs for each training was chosen as that which produced 
the highest accuracy on the validation set. Once training was investigated by 5-cross validation, the 
data were recombined and training was repeated using 5/5 of the data. 

3 Results 

3.1 Selected Features 

The ten features selected on the basis of distinguishing order and disorder for the NMR and X-ray 
latasets are shown in Table 1. Six of the ten features were the same or ™*»»*J^ 
index, hydropathy, and mole fractions of Y, W, C, and S. These data indicate that the NMR-and-X- 
ray-characterized regions of disorder share important characteristics. 

With regard to the selected features that were different for the two datasets, the compositions of H, 
D K, and E distinguished ordered and disorder for the X-ray dataset, whereas the compositions of F, 
G R and P were uWul for the NMR dataset. Thus, the features selected for the NMR-charactenzed 
regions of disorder show important differences from those selected for the X-ray-characterized regions 
of disorder. 
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Table 1: Selected features. 



X-RAY NNP 


NMR NNP 


70% 


86% 


77% 


84% 


77% 


89% 


72% 


86% 


70% 


89% 


Average 


Average 


73% ± 2% 


87% ± 4% 



Table 2: Five Cross Validation Results. 

3.2 Five Cross Validation 

The evaluation of the training of the X-RAY NNP was described previously [43]. Here those data are 
compared with the results of a similar training exercise for the NMR NNP (Table 2.). Overall, the 
NMR NNP gives a significantly higher accuracy compared to the X-RAY NNP during the training 
exercises, e.g. 87%± 4% compared to 73%±2%. 

3.3 Example Predictions 

Example predictions are shown in Fig. 1. The X axis is the residue number while the Y axis is the 
prediction output. Anything above an output of 0.5 is considered a prediction of disorder. The solid 
horizontal line at the center of the graph indicates what regions are actually disordered. Fig. 1A 
and Fi. IB are predictions on disordered proteins, while 1C and ID are predictions upon the ordered 
control proteins. One of the best overall predictions (1A) and one of the worst (IB) on regions of 
disorder as well as the two worst overall predictions (1C, ID) on the control proteins are provided. In 
(LA), the prion protein from the NMR dataset was subjected to analysis using both the NMR NNP 
and the X-RAY NNP. Notice how the X-ray prediction accuracy is relatively similar to that of the 
NMR predictor, which was trained on this protein's data (X-RAY NNP = 88.4% correct overall; NMR 
NNP = 97% correct overall). In (IB), the anti- termination (AT) protein from bacteriophage lambda 
from the NMR dataset was predicted upon by both predictors. Here, the accuracy of the X-RAY 
NNP (53.2%) seems very poor, but notice how its prediction is again somewhat similar to that of the 
NMR NNP, which, despite having this protein in its training set, still manages an accuracy of only 
73%, much lower than that obtained on the prion protein in the previous example. Finally, in (1C) 
and (ID) the predictions on the ordered control proteins, profilin A and haloalkane dehalogenase, are 
presented, respectively. The false positive predictions of disorder are seen to be very short, especially 
for the NMR NNP. 

For long disordered regions (LDRs) such as that for the AT protein, even modestly successful 
prediction rates (e.g. just 53.2% for the X-RAY NNP) still give an indication of protein disorder. For 
this reason, we are initially focussing our attention on proteins with such LDRs. 

Fig. 1 also indicates several types of errors. Relative to prediction of disorder, false positive 
predictions are ordered regions incorrectly predicted to be disordered (peak labeled b in 1A) whereas 
false negative predictions are disordered regions predicted to be ordered (region a in 1A, regions c, d, 
and e in IB and so on for 1C and ID). Another useful classification is whether an errant prediction is 
false (for example peak a in 1A) throughout (e.g. a non-boundary error, for example peak b in 1A) 
or is correct over some region but then becomes false upon crossing an order/disorder junction in the 
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Figure 1: Example Predictions using the X-RAY NNP and NMR NNP.- Example predictions 
using he X-RAY and NMR NNPs. Both predictors were applied to the following proteins: murine 
K K w en ° P 66 A antitermination P rotein (B), profilin A (C) and haloalkane dehalogenase 
(D). The first two contain regions of disorder and the last two are ordered, control proteins. The 
X-axis is the residue number, while the Y axis is the prediction output. Values above 0.5 indicate 
disorder, below 0.5 indicate order. The solid line at 0.5 indicates an identified region of order, a dashed 
me a region of disorder. Various types of errors are marked and indicated by letters a, b, c, etc (see 



structure (e.g. a boundary error). 



3.4 Prediction Accuracies 

The X-RAY NNP was applied to the NMR-characterized proteins and the NMR NNP was applied 
Tabled y " CteriZed Pr ° teinS ' reSUltS ° f th6Se out -° f - sam P le Predictions are presented in 
For the X-RAY NNP, the overall prediction accuracies range from 53.2% (AT) to 93.5% (HMGI(Y) 
The large range of error rates undoubtedly relates to a variation in the degree of similarity of the 
disordered regions in the different proteins to the disordered regions used to train the X-RAY NNP 
For example unlike most of the NMR proteins, the HMGI(Y) has local charge imbalance, thus having 
charge attributes commensurate with those of the X-ray training set and giving a very high overall 
prediction accuracy by the X-RAY NNP. 

♦ ^,2"}^ w P ' ^ ° Vera11 prediction curacies range from 55.1% (tyrosyl-tRNA synthetase) 
to 94.1/c (Bcl-xL). Again, the low prediction accuracy on the synthetase signals a difference in the 
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•ization 

A ~ NMR - B=X-ray diffraction, C= CD, D= Protease 



hypersensitivity. 



Table 3: Cross Prediction Results. 

characteristics of its disordered regions compared to those in the NMR dataset. The details of this 
diiference await further study. On the other hand, the disordered region in Bcl-xL must be more 

Z 7 I °!T n training SCt U * like1 ^ to be -incidental that the trucZ of Bd-i 

has also been determined by NMR [37]. 

Application of the NMR and X-RAY NNPs to the control proteins was carried out. The results on 
a P-tem-by^rotem b^sis are shown in Table 4. The error rate ranges from a low o 72.7% X^AY 
Haloalkane Dehalogenase) on to a high of 100% (many proteins) 1 

Finally, the overall prediction accuracies are summarized in Table 5. The NMR and X-RAY NNPs 
give similar overall prediction rates near 74% on each other's training sets. The predict^ on the 
NMR NNP Th h P ro tems are considerably better, about 84% for X-RAY NNP and 98% for tt 
NMR NNP. The high accuracy on the fully ordered proteins implies that the ordered part of our 

h! 8 T° Vldmg r PrediCt ° rS inf ° rmati0n that alkws for better generalizat on thl 
that achievable from our disordered data. man 



4 Discussion 



4.1 



X-ray- and NMR-Characterized Regions of Protein Disorder 



Our pilot studies mdicated a definite relationship between amino sequence and the presence of ordered 

IT, r Tt r 1 ' 44 ' 45 ' 201 H ° WeVer ' th6Se imtiaI Studies had two - inflated, ac^owl 
edged hmitations related to their exploratory nature. First, the number of disordered example^ was 

^"Sit; t r, idS in K he ° riSinal LDR SGt SeCOnd ' al1 the disordered «m^Z^ 
LrLSv tv, COnS ! de f rab i With regard t0 the <Wteri«tio«i of disordered proteins. 

— S Liquet ^ d6nCy w ? SOrder S6em t0 be S ° Str ° nS that these two limit ^s 

2 £n a 7 , SC err ° rS - W ° rk in Pr0gress with a Wold lar S er da t^e, now about 

pfeparaS ^ ^ ^ ^ * tk>Se ° f the P ilot studies (-nuscript in 

The small number of examples of disordered proteins in our original studies resulted from the 
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X-RAY NNP on Control Proteins 






Protein 

Hen egg-white lysozyme 
Ribonuclease A {Rnase A) 


Length of 
sequence 
1-129 
1-124 


Known 
disorder 
None 
None 


Region 
predicted 
15-115 
15-110 


Predicted 
DR lengths 
4 

3, 3 
None 
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correct 

96% 

93.8% 
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False 
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N/A 
N/A 

n7a 


False 
positive 

4% . 

6.2% 

0% 


Structural 

characterization 

B 

B 
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Ref 

53 
10 
*l 




0-cryptogein (B-cryp) 
Elastase 

Profilin A (Pfln A) 
Haloalkane Dehalogenase 
fHDHase) 


1-98 
1-240 
1-125 
1-310 
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None 
None 
None 


15-84 
15-226 
15-111 
15-297 


14, 7,4 
12, 11 

22, 6, 2, 13, 7 


87.4% 
75.3% 
72.7% 

83.2% 


N/A 
N/A 
N/A 

N/A 


12.6% 
24.7% 
17.3% 

16.8% 


B 
B 
B 

B 


35 
24 
[27 

[18 




Azurin II (Aa II) 
Carboxypepitidase 
A fCbPA) 


1-129 
1-307 


None 
None 


15-115 
15-293 


7, 10 

3, 9, 6, 9, 3, 
2. 6, 21, 16 
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Protein 

Hen egg-white lysozyme 
Ribonuclease A (Rnase A) 


Length of 
sequence 
1-129 
1-124 


Known 
disorder 
None 
None 


Region 

predicted 

l5-U5 

15-110 

15-84 


Predicted 
DR lengths 
None 
6 

None 


Percent 

correct 

100% 

93.7% 

100% 


False 

negative 

N/A 

N/A 

N/A 


"FaTs^ 1 
positive 
0% 
(3.3% 
0% 


Structural 

characterization 

B 

B 

B 






j9-cryptogein (B-cryp) 
Elastase 

Profilin A (Pfln A) 
Haloalkane Dehalogenase 
fHDHase) 
Azurm II (Ax II) 


1-98 
1-240 
1-125 
1-310 

1-129 


None 
None 
None 
None 

None 


15-226 
15-111 

15-297 

15-U5 


6, 1, 3 

6 

1 

None 


96.4% 
93.8% 
99.6% 

io0% 


N/A 
N/A 
N/A 

N/A 


3.6% 
6.2% 
0.4% 

6% 


B 

B 
B 

B 






Carboxypepitidase A (CbPA) 
Structural CharacterizaUoi 
A= NMR, B=X-ray diffractior 


1-307 

i 

i, C= CD, D 


None 
= Proteas 


15-293 
* hypersensi 


0 

tivity. 


100% 


N/A 


0% 
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Table 4: Prediction on Controls. 
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Set 

X-RAY on NMR 
NMR on X-RAY 
X-RAY on Control 
NMR on Control 


Total aa 

predicted 

on 

884 

1873 

1238 

1238 


Total 

ordered aa 

predicted on 

207 

1424 

1238 

1238 


Total 

disordered aa 

predicted on 

677 

449 

0 

0 


Total 
false 

negative aa 

179 

204 

N/A 
N/A 


Total 
false 

positive aa 

60 

265 

201 

29 


Percent 

false 

negative 

26.4% 

59.0% 

N/A 

N/A 


Percent 

false 

positive 

28.9% 

14.3% 

16.2% 

2.4% 


Percent 

overall 

correct 

72.9% 

75.0% 

83.8% 

97.6% 



Table 5: Summary Tables. 



lack of organized data on non-folding amino acid sequences. PDB is the largest organized source of 
information about proteins with regions of disorder, but as our initial studies clearly demonstrate, 
PDB is strongly biased against the presence of disorder [44], so even this source does not have very 
many examples. 

In addition to being few in number, X-ray-characterized regions of structural disorder have alter- 
nate possible causes for the observed missing electron density, including the following possibilities; 
1. A locally structured domain could be moving; 2. a locally structured domain could be occupying 
several alternative positions; 3. a local region of sequence could comprise an ensemble of intercon- 
verting shapes; and 4. A local region of sequence could comprise an ensemble of static shapes. Prom 
our perspective, the important distinction is whether a region of sequence folds into a single structure 
(e.g. either 1 or 2) or comprises an ensemble of structures (e.g. 3 or 4). The distinctions between 1 
and 2 and the distinctions between 3 and 4 are less important; indeed, 1 versus 2 and the 3 versus 4 
distinctions are a continuum that depends only on the timescale. Here, we refer to locally structured 
domains that are disordered by movement or by occupancy of different positions (e.g. either 1 or 2) 
as 'domain wobble.' 

Others have used 'dynamic,' 'static,' 'hinged' and 'flexible' to describe the various possible causes 
of structural disorder that leads to missing electron density in X-ray crystal structures [7, 25] but 
these previous terms do not correspond in any precise way to the 4 possibilities listed above. It is for 
this reason that we are proposing the terms 'domain wobble' and 'ensembles of structures' to contrast 
the distinctions that we believe are important to this work. In our initial studies we attempted to 
eliminate wobbly-domains by literature studies on each protein. However, not only does PDB lack 
clear information regarding disorder, but attempts to find information from the original literature often 
fail because the needed experiments have simply not been done or have not been reported. We hope 
that, as the importance of disordered protein becomes more generally realized, the key information 
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about such disordered regions will become more readily available. 

Given the above, extending the studies of disorder to include regions characterized by NMR is 
important for overcoming both limitations of our initial studies. NMR-characterized disordered regions 
include information about the extent of folding of the disordered region and at the same time increase 
the number of examples. 

Unfortunately, there are relatively few examples of NMR-characterized regions of disorder, and 
these are scattered in the literature and not collected at one location. So, just as for development of a 
disordered regions database using PDB, an intensive effort is required for each new entry characterized 
by NMR. The amount of effort required will continue to slow the rate of enlargement of our database of 
disordered protein. Nevertheless, the amount of disordered data in this paper has more than doubled 
the data compared to that of the pilot studies. 

4.2 Feature Selection 

Our initial work [45, 44, 43] emphasized the use of sequence attributes based on amino acid compo- 
sition. We reasoned that LDRs could be considered to be a new "structural class," and amino acid 
compositions had been shown to be successful for protein class prediction [38]. We are aware that 
considerations of coupling effects among different amino acids has led to much improved prediction 
of protein class [15], and we would like to apply such approaches to disorder prediction. However, 
consideration of amino acid pair frequencies requires much more data than we currently have, so such 
approaches are simply out of reach at the present time. 

The feature selection experiments in the development of the X-RAY and NMR NNPs (Table 1.) 
suggest substantial similarities for the disordered regions characterized by these two methods. While 
6 out of 10 of the features are identical to each other, the remaining 4 contribute enough information 
to cause the differences noticed between the predictors. The fact that the NMR NNP has both a 
higher false negative rate and a lower false positive rate than the X-RAY NNP suggests that the NMR 
NNP has a higher threshold to which it ascribes its disordered features. This may be due to the fact 
that the NMR NNP's training set contains more extreme values for the attributes specific to disorder 
within its training set (see Fig. 1), values not found as frequently within the X-ray data set correlating 
with disorder/order predictions . 

We have developed a substantially larger database, having approximately 2,500 disordered amino 
acids in windows of 21 matched with an equal number of ordered amino acids in windows of the same 
size. Studies in progress on this larger database indicate that charge imbalance, when it exists, is a 
very strong determinant of local disorder (manuscript in preparation). The selection of H, D, K and E 
for the X-ray dataset is the result of substantial charge imbalance in several of the disordered regions 
in the X-ray-characterized proteins. In contrast, charge imbalance is not so important for the current 
NMR dataset. 

Flexibility index, hydropathy and the mole fraction of S were found to be relatively higher in 
disordered regions as compared to ordered regions for both the NMR and X-ray data in the enlarged 
dataset, which is in complete agreement with the pilot studies [44, 42, 43, 45]. More flexible, more 
polar regions are more likely to be disordered. Not only does S promote disorder by its polarity, but 
it is special owing to its generally high abundance coupled with its ability to stabilize multiple local 
backbone conformations by side-chain-backbone hydrogen bonding [48]. 

On the other hand, Y, W, and C were lower in the disordered regions as compared to the ordered 
regions, both in the new enlarged database and in the data used for pilot studies. In several different 
datasets of ordered and disordered regions, W, Y, and C have always been found to be lower in 
disordered regions: indeed, these three appear to be the most order-forming of the natural amino 
acids (manuscript in preparation) . The order-forming tendencies of W and Y may be related to the 
extra stability arising from aromatic/aromatic interactions [12], while the ability to form disulfide 
bonds is an obvious reason for the order-forming potential of C. Interestingly, W, Y, and C also 
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evidently have the highest tendency to be conserved as judged, for example, by the various values in 
the PAM 250 matrix [17]. 

With regard to the features selected for the NMR dataset, the studies in progress on the larger 
dataset (manuscript in preparation) suggest that disorder is found to be associated with high mole 
fractions of R and P and with low fractions for F. R is typically an uncommon amino acid, but when 
there are high local concentrations R, it likely induces disorder by charge imbalance. High levels 
of proline prevent compact folding; indeed, proline-rich regions are common in proteins and seem 
to have function associated with their ill-folded conformations [52, 3, 39]. Higher local concentra- 
tions of F probably encourage order for the same reasons as Y and W - due to extra stability from 
aromatic/aromatic interactions [12]. 

In the study on the larger dataset mentioned above, the mole fraction of G when considered alone 
is found to be essentially uncorrelated with either order or disorder, so it is unclear why this amino 
acid was selected for the NMR dataset. On the other hand, in the development of flexibility index, G 
was found to change markedly, showing high flexibility index values when next to flexible neighbors, 
but showing low flexibility values when next to less flexible neighbors [50]. Thus, the selection of G 
may reside in its behavior in conjection with neighboring amino acids in the ordered and disordered 
sequences. This is consistent with the way feature selection works, trying to select not only the best 
features but also the best combinations of features. The mole fraction of G may not be correlated 
with order/disorder, but its combination with other variables improves the discriminatory power of 
the predictor. 

Since the NMR NNP has a lower rate of false positives than the X-RAY NNP we plan to be use it to 
complement other NNP's by helping weed out false positive predictions. Future predictors based upon 
a larger training set of diverse proteins may yield better results, as characteristics of disorder from 
differing families of proteins are incorporated into the predictors. A new predictor is currently under 
development that combines both of these training sets, and selects from a greater number of features 
(Unpublished). Preliminary results show greater accuracy as judged by 5-cross validation, suggesting 
that enlarging the data set can lead to greater prediction precision as more features indicative of 
disorder are included. Also, a larger data set can provide our predictors with enough disorder data to 
allow for better generalization. 

4.3 Out-of-Sample Predictions 

The X-RAY NNP and the NMR NNP give similar overall prediction accuracies on the proteins used 
in each other's training sets: 72.9% for the X-RAY NNP on the NMR data and .75% for the NMR 
NNP on the X-ray data. However, significant differences become evident when the types of errors are 
considered. 

The X-RAY NNP on the NMR data exhibits similar false positive (28.9%) and false negative 
(26.4%) rates, whereas the NMR NNP on the X-ray data exhibits large false negative rates (59%) 
and small (14.3%) false positive rates. Additional insight follows from noting that the more or less 
uniform performance of the X-RAY NNP on the NMR data with a overall accuracy of 72.9% closely 
matches the 5-cross validation results (73%), whereas the much more variable performance of the NMR 
NNP is associated with a large drop-off in the out-of-sample predictions (about 75%) as compared to 
the 5-cross validation results (87%). Overall, these data suggest that the NMR NNP is much more 
specific for the disordered regions characterized by NMR, whereas the X-RAY NNP appears to be 
more general. 

One possibility for the poor performance of the NMR NNP on the X-ray-characterized disordered 
regions that these disordered regions are misclassified, e.g. they are actually wobbly domains. 

For example, a region of missing electron density in a tyrosyl t-RNA synthetase different from 
the one in the present studies was later shown to have a considerable part that is ordered [28]. This 
observation on another t-RNA synthetase coupled with the high false negative error rate (74.6%) for the 
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NMR NNP (Table 3.) could be an indication that we have misclassified the "disordered region identified 
, Z d f raCtion in this P rotein - 0n the oth er hand, the NMR NNP's worse false negative prediction 
(e.g 83% for topoisomerase II) would seem to indicate that this disordered region surely must be 
misclassified However, the disordered region of this protein has been well-characterized to lack ordered 
structure: the putatiyely disordered region is extremely rich in charged residues and hypersensitive 
to protease digestion [13, 49]. Indeed, most of the disordered regions of the X-ray characterized data 
exhibit hypersensitivity to protease digestion at multiple sites, which argues strongly against ordered 
structure for these regions. The NMR NNP shows very poor performance on several of the X-ray- 

m^ C ^ r6 f 0nS ° f diS ° rder With faIse negative values wel1 over 50 %- Th e overall accuracies of the 
NMR NNP on these proteins gives reasonable values, above 50% in every case, because the predictions 
on the ordered parts of these proteins have good accuracies. 

An alternative possibility to explain the very poor predictions of the NMR NNP on the X-ray 
i^fvSS P r0teinS L is that their disorder de Pends on charged residues, which are utilized by the 
X-RAY NNP but not by the current NMR NNP. As mentioned above, the disordered region of topi- 
somerase II is highly charged and therefore consistent with this suggestion. Examination of the other 
disordered regions with high false negative error rates shows that these regions are also highly charged. 

4.4 Control Predictions \ 

The false positive rates of 16.2% (X-RAY NNP) is much lower than the evaluation of this same predic- 
tor on NRL.3D which gave a false positive error rate of 31.5%. The reasons for this large discrepancy 
are unclear, but may relate to the selection criteria for the control proteins in this study (e g small 
overall size to match the sizes, monomeric, no ligands, except for the metal ion in carboxypeptidase) 
JNKL-3D contains large proteins, oligomeric proteins, and proteins with bound co-factors. All of these 
factors could contribute to false positive predictions. For example, a local tendency for disorder would 
be more hkely to be over-ridden by non-local interactions in a larger protein as compared to a smaller 
one simply by chance, so larger proteins are more likely to have false positive predictions of disorder 
We are testing this possibility. Oligomeric proteins might associate via disordered regions, in which 
case a prediction of disorder would appear as a false positive in the crystal structure of the oligomer 
Finally, binding of co-factors could involve disorder-to-order transitions, so again predictions of disor- 
der would appear as false positives in the structure of the holoprotein. 

o a J he ^ MR NNP exhibited 3X1 specially low false positive error rate on the control proteins, just 
2.4/o. rhis is more than 6 times smaller than the false positive error rate, 14.3%, on the ordered 
parts of the X.RAY NNP training set. For the NMR NNP on the X-RAY NNP training set most of 
the false positive errors related to improper placement of the order / disorder boundary. Because of 
our windowing procedure, disorder information is carried into the ordered regions with the resulting 
tendency that the NMR NNP predicts disorder farther along the sequence than it actually occurs 
when a region of disorder is present. A further possibility is that in solution the regions of disorder 
actually extend farther along their respective sequences than indicated in the X-ray structures, which 
are more ordered than the solution state due to the ordering effects of crystal formation. 

4.5 Summary 

™ T^Tm^ ^ X " RAY NNP W6re deveI ° ped and tested on each other '* training set proteins. 
J. he X-RAY NNP seems gives similar results across the ordered and disordered regions for both its 
own training set (as measured by 5-cross validation) as for the out-of-sample NMR NNP training 
set. In contrast, the NMR predictor does much better on its own training set (as measured by £ 
cross validation) ^compared to the out-of-sample X-RAY NNP training set. These data support the 
validity of the X-RAY NNP as a general predictor of protein disorder, suggesting that the uncertain 
interpretation of disorder characterized by X-ray diffraction in principle did not lead to a significant 
problem. On the other hand, the NMR NNP appears to be much more specific, for reasons that are 



210 



Protein Structure Prediction and Design 



Veronica Morea*' , Raphael Leplae a and Anna Tramontano 

a IRBM P. Angeletti, via Pontina Km. 30.600, 00040-Pomezia (Rome), Italy 

b Istituto di Chimica del Farmaco, Universita' "G. D'Annunzio", Chieti, Italy 

*To whom correspondence should be addressed 

Abstract 

Proteins have a unique native conformation, which can be proven in many instances to be 
determined by the amino acid sequence alone. The folding problem, that is the 
understanding of how the amino acid sequence directs folding, is still unsolved, despite 
more than 30 years of efforts. However, many new methods have appeared in the past few 
years. This chapter describes the different principles underlying them and tries to give an 
overview of their successes and pitfalls. 
Abbreviations 



ID mono-dimensional 

3D three-dimensional 

CASP Critical Assessment of Structure Prediction 

PDB Protein Data Bank 

r.m.s. root-mean-square 

r.m.s.d. root-mean-square deviation 

URL Universal Resource Location 



INTRODUCTION 

The information contained in known protein structures can be of invaluable help both to 
understand the function of individual proteins, for example to explain on a chemical basis 
the catalytic activity of enzymes, and to infer the general principles determining protein 
folding. The knowledge of the 3D structure of proteins is essential to understand their 
functions and/ or properties and to be able to modify them in a predictable way. There are a 
number of successful examples where specific properties of proteins have been modified, by 
designing novel proteins, novel ligands or peptidomimetics. Modified proteins have been 
used, for example, to mimic ("agonists") or hamper ("antagonists") the action of a given 
ligand at the receptor level [1-3]. In some cases novel sequences and therefore novel 
structures have been synthesised to achieve specific tasks, or to be used as an appropriate 
scaffold for a given function ("<fe novo design") [4-14]. Peptidomimetics [15, 16] have been 
shown to be able to reproduce or antagonise the action of a protein by mimicking the 
structural elements involved in the recognition process. Small organic molecules able to 
bind to a target protein and inhibit its activity are often designed on the basis of the 3D co- 
ordinates of the binding site of a given receptor or enzyme [17-19]. 

More than five thousands protein structures are publicly available as of today [20] and, due 
to the continuous progress in X-ray crystallography and NMR spectroscopy, their number is 
increasing more and more rapidly. However, the number of known protein sequences is at 
least one order of magnitude higher and the gap between the number of known structures 
and sequences is continuously increasing, as a consequence of improvements in the 
methods of sequence determination and of the many ongoing genome projects [21] (see 



URL: h&tp: //www. sanger .ac.uk/Projects/). Consequently, the prediction of 
protein structure from their amino acidic sequence represents an appealing perspective, 
which has been intensively pursued in the last three decades. 

Although it is known that the amino acid sequence of a protein (primary structure) 
contains sufficient information to determine its 3D or tertiary structure [22], the specific 
mechanisms underlying protein folding are still eluding our understanding [23] and a 
multitude of different methods are continuously developed b try and predict a protein 
structure starting from its sequence. 

Some of these methods generate a high number of possible conformations for a given 
protein sequence and try to select the conformation corresponding to the lowest energy {'ab 
initio"). Other methods, based on the assumption that protein folding is a process under 
kinetic, rather than thermodynamic, control, try to simulate the folding pathways of a 
protein. Both types of methods are very general and, if successful, would allow to predict 
the native conformation for any given protein sequence. Unfortunately, they have not been 
very successful up to now because of the complexity of the problem: the number of possible 
conformations of an average protein is extremely high and it is very difficult to obtain an 
accurate representation of all the physical forces acting on proteins, y 
Different methods have therefore been developed to assess if a given sequence is likely to 
assume a structure similar to that of an already known protein. These methods can be 
ascribed to one of two large categories: 'modelling by homology 7 (or 'comparative 
modelling') and 'fold recognition'. They are less general than ab initio methods in that they 
require the unknown protein to be similar in structure to an already known protein and 
need an efficient way to recognise this similarity; however, their results are very satisfactory 
in many cases. 

All these methods aim at providing a model of the 3D structure of the whole protein. When 
this results unfeasible, however, it is lill possible to attempt a partial prediction of the 
protein structure, for example by identifying its secondary structure elements. 
Behind the continuous increase in the power of computing tools and the development of 
always new methods for protein structure prediction, two events in the last few years 
determined a considerable step forward in this field: 

?? The free diffusion through the Internet of most of the available data on protein 
sequences and ^ structures (see URL: http://www.embl- 

heidelberg . de/srs/srsc) and of the methods for protein structure prediction has 
given a great advantage to the community of the 'predictors' and has also allowed non- 
experts in the field to use the available methods via appropriate servers. 

?? The two protein structure prediction competitions which have been held in 1994 and 
1996 have served as an objective test to evaluate most of the published methods, 
highlighting their strengths and weaknesses, and providing the basis for further 
improvements. 

In this chapter we will summarise the current situation in protein structure prediction and 
some of the implications for protein design. This chapter is by no means intended to provide 
an exhaustive list of the available methods; we will try instead to describe the principles 
underlying them and to highlight their strengths and limitations. We will mostly limit our 
description to prediction methods tested in the two protein structure prediction 
experiments, as rigorous blind testing is the only unbiased way to evaluate their 
performance. 



THE CRITICAL ASSESSMENT OF TECHNIQUES FOR PROTEIN STRUCTURE 
PREDICTION (CASP) EXPERIMENTS 

In 1994 and 1996 two large-scale experiments to critically assess the state of art in protein 
structure prediction have taken place (see URL: 

http://PredictionCenter.llnl.gov/) . These experiments consisted into two 
phases. First, X-ray cry stallographers and NMR spectroscopists were asked to provide 
information about structures which were about to be solved or which had already been 
solved but not yet publicly disclosed. Second, the scientific community was asked to submit 
predictions for one or more of the target proteins. These predictions were subsequently 
compared to the experimental structures. 

The predictions were divided into three categories according to the method used: 

1. Comparative modelling 

2. Fold recognition 

3. Ab initio predictions 

(In the second experiment a 'docking' category was also present but it will not be discussed 
here). 

The predictions were assessed by independent teams, one for each category and meeting 
were held in December 1994 and 1996 "to examine what went right with the predictions, 
what went wrong, and, where possible, to understand why" [24]. 

There are obvious limitations to the significance of the results: the targets certainly do not 
represent a statistically unbiased sample of all possible protein structures; some groups only 
submitted a small number of predictions, the time allowed for the prediction was limited, 
different groups put different effort in the experiment and different methods were at 
different stages of development. Nevertheless these experiments still provided an objective 
picture of the capabilities and the deficiencies of most of the existing methods. From this 
picture, it became evident in which fields an improvement was mostly needed and that 
algorithms claiming predictive capabilities should be asked to demonstrate them through 
blind testing. 

MODELLING BY HOMOLOGY AND LOOP PREDICTION 

I. RATIONAL BASIS OF THE METHOD; RELATIONSHIP BETWEEN SEQUENCE 
IDENTITY AND STRUCTURAL SIMILARITY 

'Modelling by homology' or 'comparative modelling' consists, in very simple terms, of two 
steps: 1) identification of the protein(s) of known structure ('parent') whose sequence is 
most similar to that of the protein to be predicted ('target'); 2) building of a model of the 
structure of the target protein using that of the parent protein(s) as a 'template'. 
The rationale for this procedure is that there is a clear relationship between sequence 
identity and structure similarity in proteins: it has been shown [25] that the similarity of the 
backbone conformation in the core regions of two proteins increases with the sequence 
identity between them. In particular: 

?? For proteins with sequence identity = 50% the r.m.s.d. of the backbone atoms of the core 
region 1 is = 1.0 A and this region comprises about 90% of their structure. 



1 The core region is defined by superimposing the backbone atoms of the secondary structure elements, and 
extending these elements to include additional residues at their ends, as long as the r.m.s.d. is = 3.0 A [25]. The 
percentage of residues in this region depends upon the structural similarity between two proteins and, 
consequently, upon their sequence similarity. 



77 For proteins with sequence identity = 20% the core region could comprise oriySWof - 
their structure with an r.m.s.d. of the backbone atoms in this region > 1.8A, relevant 
structural differences can occur outside of the core. 
7? Proteins with sequence identity between 20% and 50% have an intermediate degree of 

similarity between those described. 
According to these observations, a known protein structure will be a good .template , for the 
farget prolin if the sequence identity between them is = 50%, while it wiH be g™-^ 
difficult to build a reliable model when the sequence identity is lower than 20-30 X [26LIt 
should be mentioned however, that in some cases, even an approximate model based on a 
sequence identity lower than 20-30% can be useful for many practical applications as long 
as additional information is available [27]. ir ^ x7 „ c nATAWi cpQ 

II SOURCE OF DATA: PROTEIN SEQUENCE AND STRUCTURE DATABASES 
The information required for model building by homology are of two types: 
?? ID information: nucleotide and amino acid sequences; 

?? 3D information: protein structures. t „n art i„<r 

These data are stored in databases maintained by groups responsible for collectmg, 

checking, formatting and updating*^ ^^^^ 

applica'tions or derived from other databases (for example most of the 
sequences are obtained through translation of DNA sequences rather than , from direct 
prSTequencing). Most of L databases are cross-referenced to several after datab.es 
and appropriate tools are usually provided to retrieve and, in some cases, analyse the data. 
One of the most important features of protein sequence and structure databases is the 
possibility to access them directly via the Internet (see URL: http //www .enfcl- 
Lidelberg.de/srs/srsc). Given the growing number of data bases and ther 
cont^uous improvement, the only reliable font of information about them is the Internet. 
Here we will just list the most important as a reference point for the reader: 

1 Nucleotide (DNA and RNA) sequences are collected by Genbank NEK USA) [28], by 
the EMBL Data Library or Nucleotide Sequence Data Bank (EMBL, Heulelberg, 
Germany) [29], DDBJ, by the DNA Database of Japan [30]. . 

2 Amino acid siquencL are collected by the group at the National B omedical Research 
' Foundation (Washington DC, USA), who also developed an ^formation retrieval 

system called PIR (Protein Identification Resource) [31], by the Martinsned 

Protein Sequences, Max Planck Institute for Biochemistry (Munchen, Germany), by the 

Protein Information Database JIPID (Noda, Japan) and by Swissprot at the EMBL 

3 { ™^vZ^es are the Protem Data Bank or PDB [20] at the 
' Brookhaven National Laboratory (New York, USA) which contains structures of 

biological macromolecules (proteins, nucleic acids and carbohydrates) and the 
Crystallographic Data Centre (Cambridge, UK) [33, 34] devoted to the structures of 
small molecules, that can be components or ligands of ^B^™^ 1 ^ g 
Behind these principal databases there are several derived ones. The ones listed below are 

?ROSrrE P contains protein sequence patterns (common, for example, to a protein famUy) or 
sites (diagnostic of a protein function) [35]. 



BLOCKS contains aligned 'ungapped' (see below) segments of ._ protein sequences 
corresponding to their most highly conserved regions [36]. 

DSSP (database of secondary structure assignments) contains information about the 
secondary structure assignments for each entry in PDB [37]. 

HSSP (homology-derived structures of proteins) merges sequence and structure information 
by providing alignments of the sequence of each protein structure in PDB with all its 
sequence homologues [38]. . • ■ 

FSSP (families of structurally similar proteins) contains structural alignments of proteins m 
PDB [39]. 

III. METHODOLOGY: MODEL BUILDING 

The essential steps of model building by homology are: 

1) Identification of the protein(s) of known structure with the highest sequence identity or 
similarity with the target sequence; optimal alignment between the target and template 
sequences and modelling of the main-chain of the core 

2) Loop prediction 

3) Side-chain modelling 

After optimal alignment of two sequences, cne can measure their sequence identity (by 
simply counting the number of identical residues found in corresponding positions) or their 
sequence similarity (by adding the 'similarity' score between each pair of aligned residues). 
The problem of finding the alignment of two strings of characters that maximises sequence 
identity or similarity can be formulated in precise mathematical terms and algorithms able 
to solve this problem are known since a long time [40 ( 41]. However, this optimal sequence 
alignment does not necessarily correspond to the optimal superposition between the two 
protein structures. This is mainly due to the presence of amino acid insertions and deletions 
between two homologous proteins and to the relatively arbitrary choice of the similarity 
score. 

The probability that insertions and deletions occur among related proteins is not high and, 
above all, it is not the same in all positions. For example, insertions and deletions are much 
more frequent at the protein surface, where they only determine local variations of the 
structure, than in the core of the protein or within secondary structure elements, since in 
these regions they most likely affect the protein structure and/ or function. As in a protein 
structure there is a limited number of positions in which it is possible to insert or delete 
residues without altering the protein function, it is more likely to have the insertion or 
deletions of a contiguous segment in one of these 'neutral' positions rather than the 
insertion or deletion of the same number of residues in different positions. 
The probability for a mutation to occur depends upon the similarity between the exchanged 
residues. This similarity can be evaluated on the basis of specific criteria: for example, 
conservative mutations, that is mutations between residues with similar features (chemical- 
physical properties, dimension, etc.), can be more easily accepted in a protein structure and 
mutations between residues coded by nucleotide triplets differing for a single base are more 
likely to occur. 

Methods for protein sequence alignment [40, 42] take into account the above factors by 
assigning a penalty for insertions and deletions which is higher at the beginning of the 
insertion and lower for subsequent residues, and often use scoring matrices derived from a 
statistical analysis of patterns of mutations in protein structures which assign a specific cost 



for each* residue mutation [43, 41, 44, 45]. These algorithms are able, gven a penalty for 
insertions and deletions and a similarity matrix to give the global optimal alignment 
between two protein sequences and to measure the identity or similarity between them. 
They are also extended to provide multiple sequence alignment between members of protein 
families and can also be tailored to search sequence databases for proteins similar r to a given 
sequence [42, 46-48]. In this case, the algorithms have to compare a very high number of 
sequences and some of them [46] use approximate alignment algorithms to pre-screen the 
databases. 

In some cases the output is a list of sequence alignments between the sequence of the 
protein used for the search and similar sequences found in the database [46, 48]; in others it 
is the alignments between segments of the input sequence with those of database sequences 
which are not interrupted by insertions and deletions ('ungapped' alignments) [47]. The 
latter can be useful to detect correlations between proteins which have a relatively high 
local sequence identity but a poor global similarity. 

Using database search methods it is possible to select which proteins of known structure are 
more similar to our target sequence. 

When the sequence identity is greater than 50%, it is generally possible to obtain a reliable 
alignment using any alignment method. When the identity drops to less than 40% 
automatically generated alignments usually contain errors, which can often be corrected 
manually on the basis of different criteria. It is usually advisable to build a multiple 
sequence alignment of as many proteins of the same family as possible, because this can 
help in assessing the correctness of the alignment. For example, sometimes secondary 
structure information on at least one of the proteins of the alignment is available, either 
from X-ray or NMR structure determination, or can be obtained using secondary structure 
prediction methods. In this case it is possible to verify whether insertions and deletions fall 
outside of secondary structure elements and, if not, to modify the alignment appropriately. 
Many protein sequences contain specific patterns of residues which are characteristic of the 
family they belong to; the residues belonging to these patterns, as well as those involved in 
protein function (e.g., catalytic residues of enzymes) should be correctly aligned. The 
multiple sequence alignment of the target with similar sequences will show conserved and 
variable regions within the family and this can help in aligning distant homologues. 
Literature and experimental data should also be used to check and refine the alignment. 
It is important to highlight that the correct alignment of the 'target' and 'template' 
sequences is the fundamental step in any homology modelling procedure: errors at this level 
are the main cause of errors in the final model (URL: 
http://PredictionCenter.llnl . gov/). 

The protein with the highest sequence identity with the target is used as a template for 
modelling the mainchain of the secondary structure elements of the target. If different 
regions of the target sequence are most similar to different proteins these can be selected as 
a template for the corresponding regions [24, 49]. 

Loops are regions connecting secondary structure elements of a protein and are usually 
located at its surface. Information about loop structures is often important in that, in many 
cases, loops have an important functional role: thanks to their surface location, loops are 
often involved in interactions with other proteins or in the catalytic mechanism of enzymes 
and, in some cases, they constitute the nucleation site for protein folding [50], 



The prediction of loop structures is a particularly difficult task since they are much less 
regular and much more variable than ? -helices and ?-sheets; moreover, insertions and 
deletions are most likely to occur in loop regions, therefore their structure is often quite 
different even among closely related proteins. A satisfactory general method for loop 
prediction has not yet been developed but, in a few cases, the structural analysis of known 
proteins has allowed to identify heuristic rules and to develop methods for loop prediction. 
Other methods have been reported to be able to predict loops conformations in the context 
of a correct structure [51], but none of them has been successful in either of the CASP 
experiments, possibly because of errors in the rest of the structure [26, 24, 49] (see URL: 
http : //Predict ionCenter . llnl . gov/). 

In some cases, loop conformations can be inherited from the template structure if their 
length and sequence patterns are conserved. Alternatively, rules based on known 
sequence-structure relationships, database searching techniques or ab initio calculations are 
used. 

The conformation of short - up to 4 residue long - turns, which allow the peptide chain to 

change direction of 180° [52], is determined by the presence of special residues like Gly and 
Pro in specific positions of the loop and can therefore be predicted on the basis of the loop 
sequence [52-60]. 

In at least one protein family, immunoglobulins, functionally important loops can be 
predicted quite accurately: the identification of a limited number of 'canonical structures' 
for five of the six immunoglobulin hypervariable loops (LI, L2, L3, HI, H2) [61-64] and the 
recognition of the residues responsible for each of these structures allows to predict their 
conformation with an accuracy within 0,2-l,0A [65]. 

The predictive ability of this method has been validated through rigorous 'blind testing' 
[63]. Recently, recurrent conformations have also been described for the sixth loop (H3) [66- 
68] and a prediction method for this loop has been developed [68] (Fig. 1). It is now possible 
to accurately predict the conformation of the 10 residues close to the framework of H3 loops 
of any length, the overall conformation of H3 loops up to 12 residues in length and, in some 
cases, the overall conformation of longer H3 loops (Morea, V., Tramontano, A., Rustici, M, 
Chothia, C. and Lesk, A.M., Conformation of the third hypervariable region in the VH 
domain of immunoglobulins, submitted) 

One of the most commonly used procedures for loop prediction consists in searching the 
database of known protein structures for regions with a similar conformation to that of the 
regions adjacent to the target loop and separated by the same number of residues as those 
of the loop. This procedure is based upon the hypotheses that i) a loop with the same 
conformation of the target loop is present in the database and that ii) there is a relationship 
between the conformation of the adjacent regions and that of the loop, that is adjacent 
regions with a similar conformation are connected by loops with a similar conformation. 
However, it has been demonstrated [69] that the structural similarity between regions 
preceding and following loops of the same length is neither a sufficient or a necessary 
condition for the structural similarity of the loops themselves: similar adjacent regions can 
be connected by loops with either similar or different structure and structurally similar 
loops can have similar or completely different adjacent regions. Therefore, while it is 
possible to identify loops with a similar structure to the target loop using database search 



technique i. is not possible B- distinguish . priori afreet resultfrom a wrong result, and 

this limits considerably the ^ no , UK me dlla base of prot ei„ structures. 

Ab initio methods for loop prediction [/U-/SJ ao not use uic « r 
^ey generate different putative loop conformations and evaluate them on the teas of 
Impirfcal energv functions, often taking into account the interactions between the modeued 
ZTd the core of the protein structure. However the loop to be predicted can interact 
2 "he loops that hav'e to be modelled,^ that the complete panoply oi ^mte.chons 
annot be taken into account. The methods used to generate and evaluate the Afferent 
conformations will be described in the ab initio protein structure prediction section. 
These methods have two major limitations. 

^HSu force fields (see below) are not sufficiently exact or complete to evaluate 
lo 'reX tTene Z of the different conformation, Also, the model of the reg,ons m wh,ch 
*e loop is ^ inLted is affected by an error which could strongly influence me results of 

^^"'differen, conforms o, shot, protein segment, ouside 
Mr structural context [70-75] does not take into account terhary mterachons, that B 
m a dons with residues outside me loop which have been demonstrated to be mtporW 
Lta. i. many cases [761. Moreover, i, has been shown 

same sequence assume different conformations in duferent prolans [77). Conseque nUy, 
SoTwWch do not take into account the specific environment of the loops wrll be ab 
^ IZlt oniy those loops whose conformation is determined ^"»«- 
wUl nive incorrect results in the other cases. On the other hand loops whose 
colr^n is determined locally can often be predicted from sequence only. ,n a srmpler 
wav than usine complex ab initio calculations. 

3te toToduce a complete model, the main chain conformation of the core and ftose 
S Ae o^however obtained, are merged together. However, as 

oredict loop conformations are able to Identify the correct conformation ,ust m a tolled 
„"cases [78. 79). i, is advisable to criucaHy evaluate whether it , ™ 
to model them: whUe some loops are critical for protem funchons, others are far from the 
'e^ns of main interest of me model (e.g.. binding or catalydc sites) and the* pred.ct.on 

^e^Ti-tllogy modeling experiment is me assignment of the side chatn 

side-chain preferentiaUy assumes a limited number of conformations [80). 

i.Quallv collected in a so-called 'rotamer library . ■ 

*rL can differ in the grouping of amino acid used to calculate the 
rotamer distribution, for example by taking into account the local «™^tofa rescue 
or its backbone angles [81, 82]. In some cases, these libraries, comb ™^ 
to exclude rotamers producing unfavourable steric interactions, are used to build dT the 
Me-chains of the model. However, as the target and template proteins are^ be 
related, the conformation of the side-chains of the conserved residues of the target an be 
modelled on that of the corresponding residues of template; the non conserved residues of 
le target can also be modelled by importing the conformational angles of die template up 
to where the relative length of the two side-chains permits, using rotamer libraries for the 
remaining part of the chain. 



As for loops modelling, energy-based procedures are also used. Usually these procedures 
start their refinement from a model having the most common rotamers at every position 
[821 

While methods for modelling side-chain conformation seem to perform rather well when 
given experimental co-ordinates for the backbone atoms [49], their accuracy is much lower 
for protein models and decreases rapidly as the r.m.s.d. between the model and the real 
structure increases [26] (see URL: http : / /Predict ionCenter . llnl .gov/). This might 
indicate that an improvement in this area could be automatically achieved as a consequence 
of improvements in backbone modelling [26]. 

IV. MODEL REFINEMENT 

After a complete model has been built, this has to be inspected, both visually and through 
the use of specific programs, to evaluate and optimise it. Unfavourable steric interactions 
have to be relieved, by changing the side-chain conformations through few cycles of energy 
minimisation or geometric refinement. Both techniques can also be used to optimise those 
main-chain regions that, because of the insertion of loops, result from joining fragments 
coming from different proteins. However, it should emphasised that neither methods can 
substantially modify the starting model and consequently adjust large mistakes [23]. 
Many attempts have been made to obtain a global refinement of a protein model using 
energy minimisation techniques or molecular dynamics (see: ab initio methods) but it has not 
yet been proven that these methods can improve the quality of the model. Energy 
minimisation algorithms will only find the local minimum closer to the starting 
conformation. Energy minimisation of protein crystal structures usually leads to a local 
minimum with an r.m.s.d. of about 1.0A from the native structure, which is comparable to 
the expected error for a model built from a template protein with sequence identity - 50 A 
[25] In blind tests, energy minimisation and molecular dynamics did not improve the 
quality of the models and often models built without any further refinement were : closer to 
the real structures than those 'optimised' using various combinations of these methods 126] 
(see URL: http : //Predict ionCenter . llnl . gov/). 
V. EXPECTED ACCURACY OF THE MODEL 

The overall quality of models is highly dependent on the quality of the sequence alignment 
and on the degree of similarity of the target with the parent structure. Both factors are 
related to the degree of sequence similarity and to the number of insertions and deletions 
between target and parent [26]: a model built for a target with a medium to high sequence 
identity (> 40 %) and without insertions or deletions with the template is generally highly 
accurate [26] and can be almost as accurate as crystal structures when sequence identity is 
high (-85%) (see URL: http://PredictionCenter.llnl . gov/). 
An upper threshold for the accuracy of homology models can be established based on the 
differences between different structural determination of the same protein which have 
r.m.s.d. values around 0.25-0.40 A) [25]; a model cannot be expected to be better than this 

^correctness of the alignment is the main factor influencing the r.m.s.d. between the 
model and the real structure: even small errors in the alignment give rise to high r.m.s.d. 
values, while a correct alignment will allow to produce very good models, at least for the 
core regions, even in predictions based on distantly related parent structures (see URL: 
http : //PredictionCenter . llnl . gov/). This emphasises the need for careful analysis 



and manual editing of the alignment for pairs of sequences with < 40 % of identity since 
automated methods do not provide good alignments in this range . 

Given a correct alignment, the quality of the prediction of the main-chain in the core region 
of the target protein can be evaluated on the basis of the relationships between sequence 
identity and structural similarity previously described [25]. 

It is worth mentioning that some completely automated methods tested in the CASP 
experiments [83, 84] proved to be able to build correct models when sequence identity with 
the parent is very high (85%); however, for more distantly related proteins human 
intervention, first of all in the correction of the sequence alignment, is still required to obtain 
reasonably accurate models [26]. 

The quality of the prediction should be evaluated using our knowledge of protein 
structures: 

?? The determination, through appropriate programs [85], of the solvent accessible surface 
of each atom or residue allows to distinguish between buried and exposed residues and 
to assess whether the partition of hydrophobic and hydrophilic residues between the 
surface and the core are comparable with what is observed in real protein structures. 
?? The determination of atomic volumes allows to evaluate the packing and to identify 

cavities larger than those usually present in protein interiors [86]. 
.?? Unpaired hydrogen bond donors or acceptors should not be present in solvent 

inaccessible regions of the proteins [26] . 
?? The stereochemical quality of the model can be evaluated on the basis of standard 

parameters derived from the statistical analysis of known protein structures [87]. 
One conclusion derived from the comparison of different predicted structures is that the 
best models are those which deviate less from the parent structure range (see URL: 
http://PredictionCenter.llnl.gov/). In other words, any attempt to model ex 
novo regions of the protein or more sophisticated approaches which inherit their structures 
less directly from the parents seem to perform less well: this implies that modelling 
techniques are still not able to add features to the models. Open problems are the modelling 
of main -chain segments whose conformation differs from that of the parent structure or 
that are shifted as rigid bodies with respect to the parent [49], the modelling of loops other 
than those predictable from sequence, the modelling of side-chains when the backbone 
conformation-of the parent deviates significantly from thatof- the target. 
It is important to note, though, that the most conserved regions in proteins are those that 
have an important structural and/ or functional role and these regions are often those 
modelled with higher accuracy [23]; therefore, in spite of their possible shortcomings, 
models built by homology generally contain a wealth of practically useful information and 
are often instrumental in interpreting experimental data, in planning new experiments and 
in guiding the design of modified proteins [49]. 
FOLD RECOGNITION METHODS 

Fold recognition techniques try to identify known protein structures which are compatible 
with a target sequence, even if the template and the target share no detectable sequence 
similarity. The rational basis for these methods are the following: 

1. The relationship between protein sequence identity and structure similarity [25] is not 
biunivocal: proteins with high sequence identity invariably have similar structures but 
proteins with similar folds can arise from both similar and completely different 
sequences (and functions). 



2. The majority of known protein structures can be grouped into a limited number of 
structural classes [88]; it is therefore likely that the number of possible protein folds is 
limited [89]. 

As a consequence of these observations, there is a reasonably high probability that the 
protein structure database contains structures similar to that of a target protein, even if 
sequence search methods are unable to detect the similarity. This probability will grow as 
new proteins structures with novel folds will be determined. However, if the target 
sequence shows no significant sequence identity with proteins of known structure, new 
criteria to identify the related known protein structure have to be developed. 
Because of the reasonable success of fold recognition techniques in CASP1 [90], they have 
been the field of protein structure prediction which expanded more rapidly: in CASP1, 8 
groups participated in this section while in CASP2 the groups were more than 34 [91]. 
Also in fold recognition, the use of evolutionary information can improve the results [92] 
(see URL: http://PredictionCenter.llnl.gov/), for example a prediction on a 
given target protein is more likely to be correct if the same prediction is obtained on a 
distantly related protein. 

I. Methods \ 

Several different strategies have been elaborated and used to recognise sequence to 
structure compatibility. These strategies can be ascribed to one of the following categories: 
profile based methods, threading and mapping methods. 

Profile based methods [93, 94] rely on the observation that each amino acid residue shows 
preferences for specific structural environment and that consequently some residue types 
are more likely to be found in a given position in a protein structure than others. 
From a statistical analysis of the data base of known protein structure, it is possible to 
classify each amino acid in classes, for example a given amino acid type can be more often 
found in buried regions of ? -helices, or in exposed loops. 

Given a target protein sequence, each amino acid can be substituted by a symbol 
representing the class it belongs to. Conversely, each position in a protein structure can be 
represented by a symbol describing its environment (exposed or buried, ? -helix or?? -strand). 
These two mono-dimensional strings can then be aligned by applying the same algorithms 
used for sequence alignment [93, 95] and the quality of the alignment between the structure 
and the target sequence can be evaluated. The alignment score will be related to the 
probability that each residue of the target sequence will be found in the environment of the 
corresponding structural position and will represent the overall compatibility between the 
target sequence and the 3D structure. Variations of this method combine information about 
environmental preferences with sequence substitution matrixes. 

In threading methods [93, 96, 97, 95, 98-100], the target sequence is inscribed in all possible 
frames into a subset of the known protein structures, selected to be as representative as 
possible of the different types of existing folds. The different alignments between the target 
sequence and each of these structures are evaluated by using some energetic function. The 
assumption underlying these methods is that the native protein structure corresponds to the 
lowest energy conformation among those accessible to the protein chain at equilibrium; 
consequently, alignments with low values of the energy function should be indicative of the 
compatibility between the target sequence and the 3D structure. 

The critical components of these methods are considered to be the energy functions 
describing protein-solvent systems, the techniques used to perform sequence-to-structure 



alignments and the criteria chosen to identify known structures which are similar to the 
A tS^^ nave been developed, usin g different approaches to 

E^SSU have been used up to now to develop energy functions ; able -to describe 
molecular systems [101]: 'inductive' approaches start from a jman chem catphysKal 
prtiples Jul use quantum-mechanical calculations to generate --empincal ene^y 
Lcrions like those used by ab initio protein structure predion methods (see b e o w^ 
^ductive' approaches, start from the experimental data, that * from known protein 
stuctures, and use statistical analysis to generate knowledge-based energy functions. The 
atter because of their relative simplicity, are the most commonly used in threadmg 



^knowledge based energy functions, the frequencies of observed events for example of 
onta ts befween two amino acid types) are extracted from the database of known proton 
structures and transformed in energy terms by applying the inverse Boltzmann s equation 

Sormulation of the potential energy as a function of inter-residue contact is based on 
*e Imption that in protein structures pair-wise inter-residue contacts between non- 
Lied amino acids are determined by the interaction energy 

other words, two residues which are often found close to each other are likely to estabbsh 

^active interactions. Of course, this assumption is not, or not completely, true. In fact, m 

Z eml any two residues also establish interactions with several other rescues wh*h 

SkhTta determinant for their relative position. For this reason, more complex potentate 

li n 4 to take into account the contacts among triplets and 4-tuplets 

residue7have been developed and reported to improve recogmhon of native folds (see URL. 

http-//PredictionCenter. llnl.gov/). - 

^veral other different formulations of the potential energy ^ *^J£»^ 

both more complex and simpler than the first pair-w,se residue potentials [102]. More 

Lmplex potentials include additional terms behind those accounting for 

interactions, for example solvent accessibility and backbone conformation term [102]. 

However, it has been shown that there is no substantial improvement in using all these 

terrT together: even a very simple potential considering only contacts between buried 

^^res^es (Leu,Tle, Cys, Met, Phe, Tr P , Val) proved to be reasonably successful 

Snt potentials also differ in the representations used for amino acid residues [103]: each 
residue I often represented by a single atom, for example C or Q or by average side- 
chain centroids; in some cases backbone and side-chain groups are distmguished^ 
UpTo now it has been quite difficult to ompare the ^^^^^ 
bid potentials [90, 91]. A generally used test is the so-called 
which She native fold of a protein sequence has to be recognised by tiue ^ 
into a library of known folds, with no gaps in the sequence and in the fold allowed in the 
threading^ocess ('ungapped-threading'). This is a necessary but not sufficient test ,n that 

t is too easy (the native structure is much more favoured with respect to the other, 
alternatives and even simple patterns of hydrophobic and hydrophilic resuiues have been 
shown to be able to identify it) and its success does not guarantee the recogmhon of simJar 

« when the native'fold is not present in the database [102]. New tests are therefore 



being developed to evaluate the ability of the potentials to discriminate between real and 
purposely built decoy structures [91] (see URL: 

http: / /PredictionCenter . llnl .gov/) or by requiring the identification of structural 
homologues in a database of known structures sharing = 25 % sequence identity with each 
other and from which the native fold has been excluded. 

The results obtained from blind tests suggest that current potentials are quite similar in their 
ability to recognise the native conformation 1 of a target protein [102]; thus, an excessive 
complication of the potentials does not appear to be justified especially since, as the 
structure database grows, the speed of the algorithm becomes a relevant issue. 
Although recent work demonstrates the theoretical basis of the Boltzmann's formulation 
[104] , these potentials have been criticised because they would not represent physically 
realistic force fields: for example, the potentials for equal charges is similar to that derived 
for opposite charges, probably reflecting the tendency of charged residues to lie on the 
surface rather than a specific interaction between them [105]. However, as the aim of 
knowledge-based potentials is to predict protein structures rather than to represent the 
'true' physical forces, essentially any potential which works can be considered a usefjil tool 
for protein fold recognition [91]. 

Once an energy function has been defined, the sequence is threaded into a library of 
structures; the 'threading' consists in 'inscribing' the target sequence into each structure so 
that each residue of the sequence replaces one residue in the structure. The alignment 
which provides a low value of the energy function that should correspond to folds 
compatible with the target sequence. 

Different solutions have been proposed for the choice of the library of folds, the way to treat 
insertion/ deletions in the sequence-to-structure alignment, the way to substitute residues 
from the target sequence into the structures and the algorithm used to align sequences with 
structures. 

Libraries of folds are constructed by selecting non-redundant entries representative of all 
known folds from the protein structure database. In some cases, just the secondary 
structure elements in the protein core regions rather than the overall structures are used, 
based on the rationale that this is the only part of the structure which is conserved among 
distantly related proteins [92] and references therein. However, as it has been shown that 
only part of the 'key-residues' responsible for correct fold recognition are found within 
secondary structure elements [92], this choice could prevent fold recognition. 
In some cases, idealised folds rather real ones have been used [92], They have been 
obtained, for example, by modifying the topologies of native protein core regions; therefore, 
even if fold recognition techniques are used, the prediction can be considered as ab initio. 
However, the results obtained for the modified topologies are worse than those obtained for 
experimental ones, suggesting that the modified topologies lack some crucial features which 
are necessary to recognise native folds (see URL: 
http://PredictionCenter.llnl.gov/) 

Even among structurally related proteins, insertions and deletions ('gaps') are likely to 
occur; the way in which gaps are treated varies notably among the different threading 
approaches. 

The insertion of gaps increases the computational complexity, so many threading 
approaches do not allow gaps at all in the sequence-to-structure alignment ('no-gap' or 
'ungapped' threading): these approaches mount the sequence on a portion of structure of 



equal length used [92]. 'No-gap' threading is not generally used for prediction purpose* in 
that if insertions and deletions are ignored it will be difficult to find a good alignment of the 
target sequence with a similar structure [99] ; still, as the native fold is usually recognised, in 
spite of bad alignments [106], this method is generally used to test the potentials [92]. 
For prediction purposes, gaps of variable length are often allowed both in the target 
sequence and in the structures used [92] and references therein; in these cases, variations of 
loop length and conformation among structurally related proteins would not prevent 
recognising a similar fold for the target structure. As is the case for sequence alignments, the 
choice of the penalty associated to insertions and deletions is quite relevant [100]. 
To reduce computational complexity some approaches substitute the amino acids of the 
target sequence to the amino acids in the structures one at the time leaving the rest of the 
structure unmodified; in this way, each residue of the target sequence is surrounded by the 
residues of the structure onto which it is mounted, rather than by the corresponding 
residues of the target sequence. This approach, called 'frozen approximation is quite 
rough- still, in 'blind tests' it performed as well as more sophisticated methods 191 J. lhe 
reason probably lies in the fact that the 'frozen approximation' may be appropriate for the 
recognition of similar folds, provided that conservative substitutions (e.g.: replacement of 
residues with similar chemical-physical properties) have occurred between the native fold 
of the target sequence and a known structure with a similar fold; if this is the case, even if 
the environment of the similar structure is not the same as that of the native one, it will be 
sufficiently close to allow recognition [91]. 

For each sequence-to-structure alignment generated by the threading algorithms the value 
of the energy function is calculated and used to evaluate the likelihood that the sequence 
can assume a fold similar to that of each structure in the data base [106]. 
A useful measure of the goodness of the sequence/ structure alignment is the z-score, 
usually calculated by most of the available programs, defined as: 
z = (E - Em)/? 

where: 

E = energy of the given alignment 

Em = average energy over all alignments 
i ? = standard deviation . 
Large negative values of the *score for the alignment of a sequence with a particular 
structure indicate that the sequence is likely to be compatible with the structure. 
It is useful to calculate the total interaction energy of each residue along the ammo acid 
sequence: native structures generally have energy values below zero in most sequence 
positions with only few weak positive peaks, and a sequence correctly aligned to a similar 
fold, does not show many positive values [101]. 

There are factors that could affect the value of the potential energy function and therefore 
the results of fold recognition experiments. For example, structures with similar sequence 
length and/or amino acid composition to the target sequence could be erroneously scored 
as similar; moreover, the higher is the number of possible alignments (because of the length 
of the alignment or of the higher number of gaps allowed) the higher the probability that a 
good alignment can be found by chance. Scoring schemes have been proposed in .order to 
correct for these possible artefacts, and this has indeed been shown to reduce the number of 
false positives [100]. 



Fold recognition can also be achieved by comparing sequence-based predictions obtained 
through ID and 2D methods, for example predictions of secondary structure, solvent 
accessibility and long-range contacts (see below), with analogous information extracted 
from known protein structures (see for example [107-109]). 

The development of these methods, called mapping, has been catalysed by the improved 
accuracy of secondary structure prediction methods which are now accurate enough to 
serve as the basis for tertiary structure predictions [110]. The accuracy of mapping methods 
is therefore expected to increase with improvements in ab initio prediction methods for 
secondary structure and solvent accessibility. 

Some methods [109] compare the secondary structure assignments (? -helices and ?-strands) 
predicted for the target sequence from multiple sequence alignments with the secondary 
structures extracted from a library of protein domains {allowing for insertions and deletions 
of whole secondary structure elements} to find all possible domains whose secondary 
structure matches that of the target sequence. A series of 'filters 7 based on simple rules 
about protein structures are then applied to these matches (or 'maps') to restrict the number 
of plausible folds. Among the filters used there are, for example, the observed and expected 
values for the radius of gyration, the distance between co-ordinates that have to be bridged 
by loops of a certain length, the ?-sheet topologies (e.g., folds with isolated ?-strands are 
removed) and distance restraints from experimental data (e.g., NMR measurements, 
presence of disulphide bridges or of clusters of functional residues). The patterns of 
predicted and experimental solvent accessibility are used to align the sequence of the target 
and that of the remaining folds and the final alignments are evaluated on the basis of 
accessibility and secondary structure matching. This procedure is able to reduce the number 
of possible folds for a target protein to a few plausible alternatives, and ideally to just one 
match [109]. The accuracy of this method is reported to be comparable or better to that of 
the more computationally intensive threading methods to recognise native-like folds and to 
correctly align amino acid residues and secondary structure elements [109]. As mapping 
methods heavily rely on the accuracy of secondary structure predictions and these, in turn, 
have been shown to be much more reliable when based on multiple sequence alignments 
rather than on a single sequence, it is believed that the essential pre-requisite for successful 
fold recognition through mapping methods is to start from a high quality multiple sequence 
alignment containing sufficient number of adequately diverse sequences [107] . 

II. Model building 

When fold recognition procedures identify a significant match between the target sequence 
and a known structure, this structure can be used as a template to build a model of the 
target protein. Model building follows the same steps described for homology modelling: the 
main-chain of the secondary structure elements can be modelled on that of the target 
protein while for side-chains and loops prediction alternative strategies have to be used. 
Once again, the quality of the alignment will be the main parameter in determining the 
quality of the final model. 

III. Accuracy 

The assessment of the performance of fold recognition methods is not a straightforward 
task in itself, in that methods to compare structures of unrelated proteins [90] and criteria 
to decide if such structures are similar or not have to be developed [111]: in some cases 
predicted and target structures ere almost identical while in other cases the similarity is 
borderline and often several possible alignments can be obtained [90]. 



The criteria used in the first protein structure prediction assessment experiment were not 
verv strineent (and were in fact modified in CASP2): 

r Tne Md was considered as correct if a significant fraction of the secondary structure 
' elements of the selected fold could be aligned with the target structure with an r.rrus.d. - 
3 OA Moreover, both the best (i.,, the lowest energy) hit and the first best hi* wer 
taken into account; the reason to consider the first 10 hits is that generally they a d have 
very similar scores and even correctly recognised folds do not have a score significantly 
hkher than incorrect folds [90]. 
2 The secondary structure segments were considered as correctly aligned rf at least one 
res due of a predicted elemL overlapped with the correspondent secondary «e 
element in the target; other indicators of the goodness of the alignment were the number 
of Residues by which secondary segments were shifted along the sequence and ^e 
IrTge shift Lr the whole sequence and the number of correctly a hgned secondary 
structure segments versus the number of theoretically alignable elements [90]. _ 
In iTsn 9 ] S the success of fold recognition methods in rigorous 'blind teste' was partial; 
howeveUold recognition is the method for protein structure prediction that has shown the 
biggest improvement in the second experiment [92]. \ n . he 

In both experiments each of the methods proved capable to identify some of the^ oHs mthe 
absence of detectable sequence homology between the target and a protem of smular 
«1ven in cases where the similarity between the target structure and the known 
folds was rather low [102, 91]. However, although all the targets were recognised by at 
Las o rmethod [111] no method was able to recognise all the targets, even if the number 
XgetlTdentified by each group increased in the second com yetinon 92]. Moreover, clear 
methods to assess the correctness of a result a priori are still lacking [90]. 
Folds which are more represented in the protein structure database are identuied more 
e°Z dually the prediction is correct), probably because the potentials derived from the 
^ultaT^L biased in favour of these structures, and also because the libraries 
rdT fold recognition often contain just a copy of less common J^^T 
of the more frequent folds, therefore there is a higher probability to identify the latter by 



The quality of the alignments of the target sequence to correctly *J^ fo ^^ 
by tteadhg methods, is correct in just a small number of cases and it is often distant from 
the optimal alignment: it is generally more difficult to obtain a correct alignment than to 

recognise a correct fold [102, 91]. . con;o j ran 

1 In the first competition significant local shifts in secondary segments were 

102 91]. This is quite a serious limit in that the lack of an accurate alignment prevents 

Z TconstructJof a useful 3D model for the target protein. 

predictions submitted for the core regions were not good enough 

modelling of the loops [90]. In the second meeting, accurate alignments have been 

provided g only for targets that, despite the lack of significant 

known proteins, could be easily recognised as homologous to some known ^ old^ased 
on the similarity of their function and on the presence of 
Nevertheless, alignments provided by fold recognition methods were consider bty bett 
than those obtained with sequence alignment methods which means that these 
methods could be used to align protein sequences with very low homology [91] 



The two CASP experiments demonstrated that the prediction results could be positively 
affected by human intervention: manual adjustments of sequence alignments, visual 
inspection of the selected fold, comparison of the secondary structure of the selected fold 
with the secondary structure predicted for the target and consideration of common 
functions between the target and the fold [107, 90, 91]. As an example, two of the 
participants to CASP2 were able to identify the correct fold for some targets based on just 
the predicted secondary structure of the targets and their deep knowledge of protein 
structures and their relationships with the function [91], However, manual intervention is 
not always successful; in some cases, correct automated predictions have been discarded in 
favour of worse alternatives [90]. 

Finally there is a distinction between 'strong' and 'weak' fold recognition [102]: strong fold 
recognition attempts to find the known fold which is structurally most similar to that of the 
target protein while weak fold recognition attempts to identify a small set of folds which are 
compatible with the target sequence and that could be subsequently analysed, for example 
considering similarity in function or experimental constraints. Weak fold recognition is 
probably a more realistic goal to achieve and is potentially able to provide very good results 
when combined with other information [102]. . 
AB INITIO METHODS 

Unlike homology modelling and fold recognition methods, ab initio methods for protein 
structure prediction do not use proteins of known structure as templates. However also 
these methods use information contained in known protein structures, although less 
explicitly, to understand general principles governing protein architecture to derive 
forcefield parameters, or as input for neural network systems. Some methods also join 
knowledge about protein structures to a priori chemical-physical principles. Abjmtio 
methods could be used both to predict whole protein structures or parts of them, although 
the former is still beyond the capabilities of the existing algorithms. 
I. CLASSIFICATION OF AB INITIO METHODS 

Two strategies are usually applied to predict ab initio the tertiary structure of proteins: the 
first one (primary -> secondary -> tertiary) consists of two steps: the prediction of the 
secondary structure from the amino acid sequence (primary -> secondary) and the assembly 
of the secondary structure elements in a 3D structure (secondary -> tertiary); the second 
consists in the prediction of the tertiary structure directly from the sequence (primary -> 

tertiary) [112]. ' , 

A great variety of methods are usually ascribed to the ab initio category, and they can been 
classified, according to the type of information that they can give, in [91]: 
?? 0D methods: predict which fold class a protein is most likely to belong to (all alpha- 
helix, all beta-sheet, alpha/beta or alpha+beta [113-115] 
?? ID methods: predict secondary structure elements (alpha-helix, beta-strand, loop) and 

residue accessibility f 
?? 2D methods: provide prediction of long-range contacts (within whole elements of 

secondary structure or single residues) 
?? 3D methods: provide predictions of tertiary structures (overall fold or shape ot the 

protein or 3D co-ordinates) 
Alternatively, ab initio methods can be classified, on the basis of the principles they rely on 

[23] as: 



?? Methods based on conformational energy calculations, e.g.: search for the most stable 
protein conformation, protein folding simulations. 

?? Methods based on the variability in families of aligned sequences, e.g.: secondary 
structure prediction, prediction of long-range contacts, prediction of functional residues. 
These methods are based on the observation that significantly more information is 
contained in the evolutionary history of a protein. The starting point for all these 
methods is therefore a good multiple alignment of the target sequence with sequences of 
homologous proteins. 

II. SECONDARY STRUCTURE AND SOLVENT ACCESSIBILITY PREDICTIONS 
Il.a. Methods 

As most of the residues in protein structures are part of regular secondary structure 
elements (alpha-helices, beta-strands, reverse turns), a lot of effort has been devoted to 
obtain accurate predictions of these segments [60]. This would be an important step toward 
the complete 3D structure prediction, provided that a reliable method to assemble correctly 
these element in space (for example based on the prediction of long-range interactions) 
would become available. 

Several methods of secondary structure prediction are based on statistical information 
derived from the analysis of known protein structures and sequences. 
Amino acid residues show conformational preferences for secondary structure elements 
and for specific positions within these elements (alpha-helix, beta-strand, reverse turn) [58]; 
even though these preferences are not very strong, the clustering in the sequence of several 
residues preferring one type of secondary structure suggests the presence of that secondary 
structure [60]. 

The periodicity of hydrophobic and hydrophilic residues can be typical of specific 
secondary structures: alternate hydrophobic and hydrophilic side-chains are likely to be 
part of a strand in a beta-sheet with an hydrophilic face exposed to solvent; alpha-helices 
should have an hydrophobic residues every three or four residues to allow an hydrophobic 
face to pack against the rest of the protein [116]. 

Among the methods for the prediction of regular elements of secondary structure (alpha- 
helices and beta-strands) described in the literature [117, 118], those exploiting the 
evolutionary information contained in a multiple sequence provide the best results [112, 
118]. Secondary structure elements are usually conserved in homologous protein structures 
and therefore a consensus prediction obtained for all sequences of the alignment is likely to 
be more accurate and reliable than a prediction performed on a single sequence because 
[118]. The PHD program which uses multiple sequence alignments as input to a neural 
network system provides particularly good results: the estimate of the accuracy is slightly 
higher than 70% [118]. Other algorithms based on multiple sequence alignments give results 
almost as good [119-121]. 

The prediction of solvent accessibility can be useful to predict the spatial orientation of 
secondary structure segments. Solvent accessibility has been described in terms of two 
(buried/exposed) [122-124], three (buried/ intermediate/ exposed) [125, 126] and ten states 
[127]. 

A system of neural network, analogous to that used to predict secondary structure, has 
been developed for the prediction of solvent accessibility [127], This system gave better 
results than other methods thanks to the additional information contained in multiple 



alignments and to the usage of a ten-state rather than a three-state model for relative 
accessibility [127]. 

An intrinsical limitation of these methods is that solvent accessibility is much less conserved 
than secondary structure in homologous proteins, therefore the information that can be 
extracted from multiple sequence alignments is lower. 

II. b. Accuracy 

The accuracy of secondary structure prediction methods depends on several parameters, for 
example, the number and type of sequences in the multiple alignment: when the sequences 
are few or the degree of diversity between them is small, the quality of the prediction is 
lower [112]. The multiple sequence alignment itself can be modified on the basis of 'expert' 
knowledge; human intervention and the use of statistical methods [128] has been reported 
to improve the prediction accuracy in a number of cases [129, 112, 107, 109]. 
Blind tests of the PHD program have confirmed the claimed accuracy (Q3= 72 ± 9%) 2 [118] 
and have shown that the predicted reliability correlates with the observed accuracy. 
Secondary structures not present in all the families of the alignment and the ends of 
secondary structure elements are often difficult to predict [112]. 

The level of accuracy reached by these methods is good enough for other methods (e.g., fold 
recognition methods) to use the predicted secondary structures elements, possibly together 
with other restraints [117], as a useful starting point to build a 3D model or as criteria to 
assess the reliability of the results. 

For methods relying on multiple sequence alignment one limitation is that the prediction of 
secondary structure elements is less accurate for those elements which are not common to 
all the family. On the other hand, elements which are in the core of the protein are those 
less likely to diverge even in distantly related proteins [25]; this makes the identification of 
the correct tertiary fold possible even if secondary structure elements outside of the core are 
not predicted correctly [129]. 

Secondary structure predictions, similarly to tertiary structure predictions, seem to be less 
efficient for unusual folds [24]; possibly, in both cases, because of the bias present in the 
parameters derived from the current database. 

III. PREDICTION OF LONG-RANGE INTERACTIONS 

These methods aim at predicting the relative spatial position of predicted secondary 
structure elements or residues either by predicting long range interactions or by using 
combinatorial approaches and/ or semi-empirical rules 
IILa. Methods 

The term 'long-range interactions' is used to described interactions among residues which 
are spatially close to each other in a 3D protein structure but far from each other in the 
protein sequence. The methods developed to predict long-range interactions exploit 
information contained in single sequences or, more often, in multiple sequence alignments, 
to give matrices of predicted inter-residue contacts in a protein (2D prediction), called 
'contact maps'. 

One of these methods [131] predicts long-range interactions between ? -strand residues in ?- 
sheets and is based on the statistically derived frequencies of pair-wise inter-residue 
contacts. Residue distribution on adjacent ?-strands has been shown to deviate significantly 
from randomness so that pair-wise preferences could be extracted from known protein 



2 Q3 indicates the percentage of residues correctly predicted to be in one of three states: helix, strand, other [130]. 



structures' [132]. In some cases, these preferences can be rationalised on the basis of 
complementary chemical-physical properties between directly interacting residues (e.g., 
Ser/Thr and Val/Ile are favoured, Thr/Val and Lys-Arg/Leu are not). In other cases, it is 
more difficult to rationalise the observed preferences, possibly because the interaction 
between residues is mediated by solvent molecules at the protein surface or by the packing 
environment in the protein interior. Specific inter-residue preferences depend upon the ?- 
sheet topology (parallel or anti-parallel), by^ the presence or absence of hydrogen-bonds 
between the backbone atoms of the two residues in contact and by the relative position of 
the two residues with respect to the N-terminal and C-terminal end of the protein [131]. 
Based on these preferences, it is possible to recognise the correct pairing of ?-strands in a ?- 
sheet in known structures with an accuracy of 75% or better. While in principle a similar 
analysis could be performed for helix -helix and helix -strand interactions, in these cases the 
lack of strong hydrogen-bonding distance constraints could make the recognition of specific 
residue^residue contacts more difficult. 

Other methods for long range interaction prediction [133-135] are based on the observation 
that residues in physical contact in the 3D structure in some cases show a correlated 
mutational behaviour, which can be recognised in a multiple sequence alignment [136, 
135]: sequence mutations that could interfere with the maintenance of structure or function 
within protein families, might be compensated by complementary mutations in nearby 
positions to allow for the protein (and cell) survival [136]. As an example, if a bulky side 
chain in the protein interior is substituted by a small one, other residues could mutate 
appropriately to fill the newly formed cavity. Consequently, it is possible that, if in a 
multiple sequence alignment two positions mutate in a correlated manner, the residues 
occupying those positions are in physical contact in the 3D structure. Pairs of residues that 
are correlated, in the sense described above, do have a weak tendency to have smaller 
distances in the 3D structure [133] and the method might therefore be useful to predict 
long-range inter-residue contacts which, in turn, can be of help in modelling the relative 
spatial orientation of secondary structure elements. One of the major shortcomings of these 
methods is that the compensatory response of a protein structure to a point mutation is not 
generally the mutation of another single residue but could involve a cluster of residues 
[133]; in some cases, the compensatory response to single point mutations is even achieved 
through small shifts of secondary structure elements [131]. To partially account for this, in 
some cases clusters of correlated residues have been considered [133]. 
As residues involved in protein function are generally conserved within protein families, 
even among distant relatives, the analysis of patterns of conserved and variable residues in 
multiple sequence alignments can be used to predict functional residues. When the 
conservation patterns are clear, functional residues can be recognised by visual inspection 
of multiple alignments, but more subtle patterns of conservation can only be catched 
through the use of specific tools [137]. The relative spatial arrangement of functional 
residues can be useful to orient secondary structure elements and to decide about the 
likelihood of a predicted structure. 

The prediction of long-range interactions can also be used to select the correct fold between 
the candidates generated by fold recognition methods [131] (R, Leplae, T. Hubbard and A. 
Tramontano, submitted for publication) 



Hl.b. Accuracy 

Methods predicting the spatial arrangement of scondary structure elements have been 
successful in 'blind' tests in recognising folding motifs similar to those of already known 
structures (e.g., leucine zippers) [129, 107, 135]; on the other hand, unusual folding motifs 
are still difficult to be predicted [24], 

The accuracy of the prediction of ?-strands residue-residue contacts is low when contact 
maps are generated from a single protein sequence but it can be considerably increased by 
using multiple sequence alignments and, in some cases, by knowledge-based considerations 
(e.g., an incorrect prediction of a parallel sheet can be easily recognised if the ? -strands are 
joined by less than 10 residues: only anti-parallel strands are connected by segments of that 
length). 

The accuracy of the prediction of long-range contacts based on correlated mutations has 
been reported to be up to five folds better than random [133], and to increase when other 
information is used. The evolutionary distance between sequences showing simultaneous 
variations, the specific type of co-variation observed (e.g., volume, hydrogen bonding, 
charge) and the tertiary structural context (interior or surface) of v the co-varying residues 
can all be effectively taken into account [135]. Restricting the analysis to the residues which 
are expected to be in the protein interior can also improve the results [135], 
Some authors [91] believe that an alignment of a very high number of sequences is required 
to even attempt a prediction of long-range contacts, and that even if the prediction of such 
contacts is possible, it is of limited usefulness because of the high number of false positives. 
Further improvements could be obtained if tools to distinguish between correlated 
mutations and mutations which do not need to be compensated ('neutral' mutations) were 
developed. 

IV. ENERGY BASED METHODS 

These methods try to predict tertiary structure from the amino acid sequence alone that is to 
solve the classical 'folding problem': given a protein sequence and a model of the 
interactions between residues, they try to recognise the protein conformation corresponding 
to its native structure [138]. Although they pursue a very difficult goal, these approaches 
have the advantage of not depending on the existence of a fold similar to that of the target 
protein in the protein database and therefore to be potentially able to predict completely 
new folds. Of course, if the number of protein folds is really limited, as most of the protein 
folds become known, the need for ab initio methods will decrease [112]. 
These methods consists essentially of two steps: the generation of multiple possible 
conformations for the target protein and the energetic evaluation of these conformations 
Some energy based methods are based on the assumption that the conformation with the 
lowest energy corresponds to the native state (that is: the folding process is under 
thermodynamic control). These methods try to generate as many as possible conformations 
of a protein structure or of a region and to evaluate them on the basis of the energy 
calculated for each conformation. 

Other methods are based on the assumption that a polypeptide chain reaches the native 
conformation through an energetically accessible pathway, without having to search the 
complete conformational space. They try to simulate the folding process by dividing it into 
several steps; at each step they generate and evaluate different conformations, selecting the 
conformation(s) with the lowest energy as the starting point for the next step [139]. 



Several methods are available to explore the conformational space of a molecule by varying 
either its Cartesian or internal co-ordinates (i.e., dihedral angles): systematic approaches, 
molecular dynamics, distance geometry and genetic algorithms. These methods 
Systematic methods evenly explore the conformational space of a molecule by varying its 
rotable dihedral angles with a pre-established increment, so that all the possible 
combinations of the selected dihedral angle values are generated. While m principle they 
can guarantee to sample quite completely the conformational space available to a molecule, 
their efficiency is limited by the number of dihedrals in the molecule and by the value of the 
increment for each dihedral. For this reason, systematic searches are usually applied to 
small protein regions (e.g., loops). 

Monte Carlo methods search the conformational space of a molecule through a random or 
pseudo-random variation of either its Cartesian co-ordinates or, -ore often, rts dmedral 
Lies: in this case, both the increment value and the number of rotable dihedrals ; var ed a 
each step can be chosen randomly. At any given point of a random search the probability of 
finding new conforms is proportional to the number of conformers not yet discovered, 
therefore this probability decreases with the search progress; consequently, also the random ^ 
methods can adequately cover the conformational space if run for a sufficiently long time. 
In molecular dynamics, proteins, possibly together with explicit solvent molecu es are 
treated on the basis of the principles of classical Newtonian mechanics. As this method ,s 
considered to reliably reproduce the motion of a polypeptide chain as a function of time, 
starting from a random structure and generating a long enough trajectory, the native 
conformation should be found. However, current computational power 1S only able to 
generate molecular dynamics trajectories for time periods considerably shorter (about 10 s 
) than in vitro folding (typically about 1 s), so the complete conformational space of a 
protein molecule cannot be explored [60]. , nn 

Distance geometry is a method to convert a set of distance constraints in a random set of 3D 
co-ordinates consistent with the constraints: the conformational space o a molecule is 
described by a matrix of distance constraints including the maximum allowed distance 
(upper limit) and the minimum allowed distance (lower limit) between any pair of atoms; 
all the randomly generated conformers lie within these upper and lower limits. This 
approach samples quickly and efficiently the 3D space but it cannot guarantee that it has 
been thoroughly searched. , 
Genetic algorithms are based on mechanisms of natural genes evolution, like mutations and 
cross-linking: several searches (mutations) are run simultaneously and ^formation is 
exchanged between them (cross-overs), thus increasing the efficiency of the overall process. 
These methods have been used both to attempt the prediction of whole proteins and to 
build loops regions or to find the correct set of side-chain rotamers given the experimental 
backbone conformation [139]. . . cn( 

To speed up the simulations, often these search methods use simplified representations of 
the polypeptide chain (for example, the amino acid side chains can be represented by 
spheres and the main chain by only C? atoms) [112]. Moreover, often the search is 
performed on lattice models rather than in the complete conformational space of the 
protein molecule. 

The only way to calculate the exact energy of a molecule is to use precise quanto- 
mechanical calculations. As these calculations are computationally intractable for a 
molecule as complex as a protein, approximate functions are generally used to calculate the 



potential energy of the structures generated By the conformational search methods. The 
form of the energy function can vary, but usually it is the sum of different energetic terms 
chosen on the basis of the forces that are expected to act on protein structures. As an 
example, the following energy function takes into account the contribution to the total 
energy of covalent bonds (stretching, bending and dihedral energy) and non-covalent 
interactions (van der Waals, hydrogen bond, electrostatic energy): 

Etotal = E s tretching + Ebending + E to rsion + E va n der Waals + Electrostatic + 
Ehydrogen-bonding 

The energetic contribution of each of these terms to the total energy is calculated as a 
function of the deviation of the observed values from a set of previously determined 'ideal' 
parameters. They represent, for each atom type, the preferred equilibrium positions (for 
example: length of the N-C? bond, distance between hydrogen bond partners, etc.). These 
'ideal' parameters together with the energy function and a set of force constants which 
penalise the deviations from the 'ideal' parameters constitute the so-called 'forcefield' of a 
molecule. 

Many different forms of this function have been developed. They can contain just a few or 
only one term or many additional terms (which penalise deviations from planarity of 
specific groups, strengthen the chirality of specific centres or couple the different energetic 
terms [23]); in several tests, extremely simple functions result as effective as more complex 
ones (see URL: http: //PredictionCenter .llnl.gov/). 

Forcefield parameters can be determined in one or more of the following ways; performing 
quantum mechanical calculations on simple model systems, deriving them from the 
statistical analysis of known protein structures or measuring them experimentally. Methods 
which only use quanto-mechanical calculations can be considered as ab initio in a strict 
sense, in that they essentially rely upon a priori chemical-physical principles [112]. 
However, most of the available forcefields do contain empirical parameters extracted from 
protein structures and are usually tested by evaluating their ability to reproduce known 
protein structure (see URL: http://PredictionCenter.llnl .gov/). A World Wide 
Web server has been designed to enable an objective evaluation of forcefields and to address 
important questions concerning forcefield development and application (see URL: 
http://iris4.carb.nist.gov/). 
IV.a. ACCURACY 

The available ab initio methods cannot provide accurate 3D structure predictions yet. With 
existing methods, just the structure of extremely small proteins for which an extensive 
conformational searching is feasible can be predicted; however, the predicted structures are 
still more than 4.oA from the experimental structures [112]. The main reason why energy 
based methods have not been very successful is that energy functions and forcefield 
parameters are neither sufficiently exact or complete to evaluate correctly the energy of the 
different conformations generated by the conformational search methods. This statement is 
corroborated by the fact that the energy minimisation of a protein structure experimentally 
determined gives a local minimum conformation with an r.m.s.d. value of about l.oA from 
the starting conformation; this can be considered a sort of 'resolution' of the forcefield used 
[23]. Furthermore, because of the errors in energetic parameters, the energy native fold is 
not significantly lower than that of some of the incorrect folds [140]. 



It has been suggested that the accuracy of these methods could be improved by explmhng 
information contained in a. multiple alignment; these information could be used for 
example, to calculate contact potentials taking into account the variability ■ ui a amily of 
ahgned Sequences [112]. Moreover, as the calculations required by ab .^methods 
computationally intensive, in this field more than in others an increase in the computer 
power could allow significant progresses [141-143, 75, 139]. 

Modelling by homology 

I. The background 
There is a clear relationship between sequence identity and structure 
similarity in proteins: the similarity of the backbone conformation in the 
core regions of two proteins increases with the sequence identity between 
them (Crippen 1977; Chothia and Lesk 1986; Hilbert, et al. 1993) and this 
relationship forms the basis for modelling by homology or comparative 
modelling. This is the most used and also the most effective method to 
obtain a structure prediction of a protein when its sequence is clearly 
related to that of a protein of known structure. 

It has been estimated that, if applied to all possible targets in the present 
sequence database, modelling by homology would allow the prediction 
of at least one order of magnitude more proteins than the protein 
structures experimentally determined so far (Sali 1995). 

La. Why build a model 

Although structural data on proteins by x-ray and NMR techniques are 
being produced at an impressive pace and the rate of structure deposition 
is continuously increasing (Fig. 5.1), the attention devoted to protein 



structure prediction is not at all diminishing, and there are very good 
reasons to pursue this elusive problem. 

First, in many cases it can be essential to gain structural information on 
the protein under study as soon as possible during a project so that 
effective experiments can be planned (Sollazzo, et al. 1990; Savino, et al. 
1993; Orlandini, et al. 1994; Pizzi, et al. 1994; Amati, et al. 1995; 
Ammendola, et al. 1995; De Francesco, et al. 1996; Luo, et al. 1996; Jackson, 
et al. 1997; Starling, et al. 1997). Second, the information provided by a 
model can be instrumental for the experimental determination of the 
protein's structure (Turkenburg and Dodson 1996). Furthermore, 
sometimes a model can be effectively used to modify the properties of a 
given protein or to explain functional differences (Scarborough and Dunn 
1994;Failla, etal. 1996). 

Last but not least, the challenge of understanding the rules underlying 
protein folding is intellectually attractive and has been one of the most 
actively pursued research fields in structural molecular biology, since the 
early success of Pauling in predicting the structure of an ? helix before 
any structural determination of a macromolecule (Pauling, et al. 1951). 
Although we have witnessed many achievements since, the problem is far 
from being solved, and in many occasions bursts of enthusiasm about one 
method or another have been followed by depressing disillusions. 
Nevertheless many methods and servers can provide putative answers to 
the problem of predicting a protein structure particularly when homology 



modelling can be used and the aim of this chapter is to highlight the 
problems underlying the various approaches, to allow the reader to 
critically evaluate the reliability of the results. •> • 

There are three important points to keep in mind in any protein structure 
prediction experiment. Stating that a model is not an experimental 
structure is a truism, but it is important to remember that even when there 
are reasons to trust the results of a modelling experiment, the level of 
accuracy of any model is, in the majority of the cases, not comparable to 
that of an x-ray or an NMR structure, even if the pictures of the model are 
equally colourful and as esthetically pleasant as those of an experimental 
structure! 

Another important point to highlight is that not all parts of the model are 
equally reliable. As we will illustrate here, each of the steps of the 
modelling procedure will introduce errors and these errors are not 
equally distributed over the model. A scenario where a theoretician 
produces a model that is delivered to the scientist who will use it, is 
therefore far from being ideal. It is essential that any conclusion derived 
from a model is carefully checked against the expected reliability of the 
part(s) of the model involved and that the modeller and the end user of 
the model work in strict collaboration. 

Furthermore the reliability of a model dictates its proper usage. 
Especially in the case of homology modelling, it is possible to evaluate a 
priori the reliability of the model, which mainly depends upon the 



sequence similarity between the target and template proteins. It is 
unreasonable to use almost any model for detailed energetic calculations, 
to predict the binding affinity of a ligand or to design an appropriate 
ligand. A model can, in most cases, be used to predict which mutations 
are likely to be accepted by the protein structure, but precise accessibility 
values of a side chain or packing details can only be derived from a 
model based on a sufficiently high sequence similarity. 
Finally, with the possible exception of models based on very high 
sequence identity, there is only one way x to gain confidence in a model 
and that is to use it to predict features of the protein. A correct model is 
one that can be successfully used to modify the properties of the molecule 
in a predictable way and only well designed and carefully performed 
experiments can be used to judge the quality of a model. 

Lb. Modelling by hand and development of 'automatic' 
modelling programs. 

The observations in the previous paragraph have implications for the 
usage of servers or programs that perform all the model building steps 
automatically (Mandal and Linthicum 1993; Sali and Blundell 1993; 
Srinivasan and Blundell 1993; May and Blundell 1994; Taylor 1994; 
Peitsch 1996; Peitsch, et al. 1996; Peitsch 1997). Most of these systems have 
an effective user interface; some have been carefully tested and their 
output reports the expected accuracy for the various parts of the model. In 



the best cases, it is ^possible to retrieve the intermediate steps of the 
model building procedure. 

It is essential to use only servers and programs that make the results of 
their evaluation available to the users, that have been tested in blind trials 
and to check the results of all the intermediate steps of the automatic 
model building, with particular attention to the alignment. 

II. The steps involved 
x Model building is a multistep procedure and can be outlined as: 
?? identification of the protein(s) to be used as template 
?? identification of the regions that are structurally conserved between 

target and template 
?? construction of a sequence alignment in these regions 
?? model building of structurally conserved regions (using the 

coordinates of the backbone of the template) 
?? model building of structurally variable regions (where for example 

insertions and deletions are present) 
?? construction of the side chains of the model 

?? refinement of the model. 

II.a. Selection of the parent structures 

Homology modelling relies on the observation that similarity in sequence 
implies structural similarity so that the co-ordinates of he backbone of 



the residues of- the template proteins can be used to build a model of the 
corresponding ones in the target protein. The similarity of the backbone 
conformation in the core regions of two proteins increases with the 
sequence identity between them. The core region here is defined by 
superimposing the backbone atoms of the secondary structure elements, 
and extending these elements to include additional residues at their ends, 
as long as the r.m.s.d. is lower than some threshold, usually 3.0 A. The 
percentage of residues in this region depends upon the structural 
similarity between two proteins and, consequently, upon their sequence 
similarity. 

It has been shown that (Chothia and Lesk 1986): 

For proteins with sequence identity = 50% the r.m.s.d. of the backbone 
atoms of the core region is = 1.0A and this region comprises about 90% of 
their structure. 

For proteins with sequence identity = 20% the core region may comprise 
as little as 50% of their structure with an r.m.s.d. of the backbone atoms in 
this region > 1.8A; significant structural differences can occur outside the 
core. 

Proteins with sequence identity between 20% and 50% have an 
intermediate degree of similarity between those described. 
This implies that the best template for model building is the one sharing 
the highest sequence similarity with the target protein. 



When more proteins of known structure with roughly the same sequence 
similarity to the target are available it is advisable to select the template 
according to appropriate criteria/ for example resolution or completeness 
of the structures or state of ligation. A typical case is for example that of 
immunoglobulins, where there are so many structures and the level of 
sequence conservation of the core is so high that such elements can and 
should be taken into account (Tramontano 1995). 

However, since sequence similarity is not constant along the whole 
sequence, the availability of more than one template couM also allow us, 
at least in principle, to use fragments of each to build the complete model 
by selecting, for each region, the protein sharing the highest 'local' 
sequence similarity (Sutcliffe, et al. 1987; Sali and Blundell 1993). It is not 
yet clear how much better such approaches perform, since they have been 
applied to a limited number of cases, but it seems that they can provide a 
somewhat better model when the sequence identity between target and 
template is low, while they can be ineffective in cases where sequence 
similarity between target and template is high (Martin, et al. 1997). 

ILb. Alignment and alignment reliability 

We have assumed that one can easily measure the sequence similarity 
between two sequences (the target and the template in this case) and there 
are a number of automatic programs that can perform the alignments and 
measure sequence identity (by simply counting the number of identical 
residues found in corresponding positions) or their sequence similarity 



(by summing the 'similarity' scores between each pair of aligned 
residues). Usually these algorithms are used in data base searches so that 
a measure of sequence identity or similarity sufficiently accurate to 
evaluate the appropriateness of a template is directly reported by these 
programs. 

Deciding which alignment should be used for model building however 
presents a different problem. 

As described in Chapter 4, the problem of finding the alignment of two 
strings of characters that maximises sequence identity or similarity can be 
formulated in precise mathematical terms and algorithms able to solve 
this problem have been known for a long time. However the alignment 
that maximises sequence identity or similarity is not necessarily that 
corresponding to the best structural superposition of the proteins, while 
in a model building experiment one needs to align the amino acids that 
are in equivalent structural positions in the two proteins. 
The importance of a correct initial alignment cannot be emphasised too 
much. Every step in the model building process will be based on the 
alignment and even minor errors at this stage will produce major errors in 
the final model: a shift of the alignment of one residue with respect to the 
correct one can produce very large errors in the final model (Martin, et al. 
1997). 

Most of the uncertainties and errors in the alignment are due to the 
presence of amino acid insertions and deletions between two homologous 



proteins arid to the relatively arbitrary choice of the similarity score and 
this has been described in detail before. However it is important to 
remember that there is a conceptual difference between building an 
optimal sequence alignment and obtaining an alignment that represents a 
structural similarity. 

For example the alignment shown here, although correct in terms of 
maximising sequence identity, cannot represent a structural alignment: 

sequence 1 LADGTRCTGRGSBW 
sequence 2 LVD - S KCRAKG - DW 
* * * * 

The amino acids marked by stars cannot correspond to each other in the 
structure, since it is impossible that two residues are in the same position 
in space and yet they are separated by one amino acid in one protein and 
none in the other! This is a trivial example, but it stresses the concept that 
a sequence alignment is a one dimensional result that should be 
translated into three dimensions. 

A more correct way to write the sequence alignment shown above could 
be: 

sequence 1 LADGT- -RCTGR- -GSDW 
sequence 2 LV DSKCRAKGD W 

to show that there is no one to one correspondence between the residues 
contiguous to the insertions: it is important to always try and imagine the 
structural implications of a sequence alignment. 



The CASP2 experiment provides a more realistic example of what we just 
described (Samudrala and Moult 1997). 

One of the target proteins for comparative modelling was Endoglucanase 
I (t0028), sharing 47% identity with a known structure (IcelA). 
Automatic alignment in one case provided the following alignment for 
the region spanning residues 49-70 of the target (Samudrala and Moult 
1997): 

TARGET: CTVNGGVNTTLCPDEATCGKNC 

I Mill I III 

PARENT: CYDGNTWSSTLCPDNETCAKNC 

while the correct alignment turned out to be: 
TARGET: CTVNGGV NTTLCPDEATCGKNC 

I I II 

PARENT: CYDGNTWSSTLCP DNETCAK-NC 

In other words, these two protein regions cannot be structurally aligned 
(the main chain varies by more than 4.0A), although the sequence 
alignment produces a convincing result by maximising the number of 
identical or similar residues. 

In an homology modelling experiment, the structure of one protein is 
known and this information should not be ignored. Once a sequence 
alignment has been obtained, it is extremely important to view the 
alignment in the context of the structure of the template protein. 
Insertions and deletions are much more frequent at the protein surface, 
where they only determine local variations of the structure, than in the 



core of the protein or within secondary structure elements, since in these 
regions they are most likely to affect the protein's structure and/or 
function. Consequently, secondary structure elements of the template 
should not be interrupted by gaps in the alignment. Similarly, completely 
buried side chains in the template should not be aligned to charged 
amino acids in the target sequence. This implies that each alignment, 
however derived, should be carefully checked and corrected manually. 
This has been shown to improve the final result in many cases and 
provides the further advantage of highlighting regions of ambiguous 
alignment that can be built with a lower level of confidence. 
In summary, the sequence alignment should be built taking into account 
the sequences of the two proteins and the structure of the template. 
However very often a multiple alignment of the protein family is 
available, and this case is becoming ever more frequent, given the 
continuously increasing number of experimentally determined 
sequences. 

Such a multiple alignment contains information about conservation and 
variability at each position and can be of great help in defining the 
positions of insertions and deletions (Altschul 1989; Henneke 1989; 
Subbiah and Harrison 1989; Thompson, et al. 1994a; Thompson, et al. 
1994b). 

In a field as difficult and challenging as protein structure prediction, one 
cannot afford to discard any information however it is obtained. Another 



important aspect is that any experimental information on the protein(s), 
has to be taken into account. Besides the obvious importance of locating 
and aligning correctly the active site residues of enzymes, there is often 
plenty of data on site specific mutants, location of epitopes and so on. 
Because of the relevance of the alignment for all subsequent steps, this is 
the stage where all the available information should be collected and 
taken into account. The alignment should be consistent with all that is 
known about the protein family under study. 

II.c. Structurally diverse regions: can we recognise them 

The analysis of a (multiple) sequence alignment can also be useful to 
highlight which regions of the target protein are more likely to be 
'structurally divergent', that is which regions do not structurally 
correspond to any of the template's regions. The local sequence similarity, 
the conservation pattern, the location of the region in the template 
structure, the absence of insertions and deletions, are all elements that can 
allow us to predict that a region is part of the structurally conserved core 
and can be built using the backbone of the template. 
If more than one structure of members of the template family are 
available, they can be used to derive the extent of the conserved core of 
the family. Structural superposition of the available proteins of known 
structures will highlight which regions tend to be conserved during 
evolution and which are subject to variation and refolding (Kanaoka, et al. 
1989; Taylor and Orengo 1989b; Zuker and Somorjai 1989; Orengo and 



Taylor 1990; Rippmann and Taylor 1991; Orengo, et al. 1992; Russell and ^ 
Barton 1992; Gracy, et al. 1993; Grindley, et al. 1993; Luo, et al. 1993; 
Orengo, et al. 1993; Rufino and Blundell 1994; Taylor, et al. 1994; Holm 
and Sander 1995a; Gotoh 1996; Holm and Sander 1996b; Koch, et al. 1996; 
Koch and Lengauer 1997; Schmidt, et al. 1997). 

Even after the alignment has been built and the model building steps 
have started, it is still worth re-examining the alignment when problems 
are encountered. For example, if some regions appear , difficult or 
impossible to model, because the resulting packing of side chains is 
either too tight or too loose or the number of residues in a loop is 
inconsistent with the distance between the neighbouring regions, the 
alignment should be checked again, using the new insights derived by the 
attempts to build the model. 

ILd. The loops 

As discussed, structurally variable regions cannot be built using the 
template, and some other method has to be employed. In general, these 
regions correspond to loops, that is regions connecting secondary 
structure elements of a protein, and are usually located at its surface. The 
prediction of loop structures is a particularly difficult task since they are 
much less regular and much more variable than ? -helices and ? -sheets 
and insertions and deletions are most likely to occur in these regions. On 
the other hand, information about loop structures is often important in 
that, in many cases, loops have an important functional role: thanks to 



their surface location, loops are often involved in interactions with other 
proteins or in the catalytic mechanism of enzymes and, in some cases, 
they constitute the nucleation site for protein folding. 
The prediction of loops can be based on the sequence patterns in the loop 
itself, on data base searching methods or on ab initio calculations. 
Methods based on sequence patterns are generally effective for short 
loops (3 to 4 residues), especially those connecting adjacent strands of 
anti-parallel ? sheets, and are based on an exhaustive classification of 
loops followed by a tabulation of the corresponding specific sequence 
patterns. 

A turn is generally defined as a loop that allows the polypeptide chain to 
change its direction by 180°. In his early work, Venkatachalam 
(Venkatachalam 1968; Venkatachalam and Ramachandran 1969) defined a 
turn as being characterised by the formation of a hydrogen bond between 
the main chain carboxylic oxygen of the first residue and the amide 
proton of the third, and identified three conformations (called I, II and III) 
according to the main chain dihedral angles of residues in the turn (Fig. 
5.2a). The mirror images of these turns (denoted Y, II', III 1 ) are also possible 
but disfavoured. Beta turns connecting adjacent strands of an antiparallel 
? sheet are called ? hair-pin and have been carefully analysed by a 
number of authors (Chou and Fasman 1977; Ananthanarayanan, et al. 
1984; Hollosi, et al. 1985; Sibanda and Thornton 1985; Milner-White and 
Poet 1986; Wilmot and Thornton 1988; Sibanda, et al. 1989; Sibanda and 



Thorntoivl9?l). Two-residue ? hairpins are often found in type I and II' 
conformations, rather than I and II as happens for most non-hairpin turns. 
These conformations are probably preferred for ? hairpins because they 
allow the correct twisting of the two adjacent ? strands. 
Gamma turns are characterised by the presence of a hydrogen bond 
between the carbonyl group of one residue and the amino group of the 
amino acid two residues ahead in the sequence and are further classified 
as classic and inverse, which are the mirror image of each other (Milner- 
White, et al. 1988; Milner-White 1990) (Fig. 5.2b). Inverse gamma turns are 
much more frequent than classic ones. 

Omega loops have been described and classified (Leszczynski and Rose 
1986; Fetrow 1995) as irregular segments of chain where sequentially 
distant N and C-terminal residues are spatially close (Fig. 5.2c). 
Short loops have to introduce a sharp change in the direction of the 
polypeptide chain and this implies restrictions on the dihedral angles of 
their residues which can in turn be correlated to the presence of special 
amino acids such as glycines and prolines. For example, an analysis of the 
sequence patterns of ? hair-pins in proteins (Wilmot and Thornton 1988) 
has led to the widely accepted view that type I turns and type II turns are 
often associated with the presence of a glycine as the third or fourth 
amino acid of the loop, respectively. Similarly, sequence patterns have 
been associated with the various turn types and with omega and gamma 



turns. 



The usage of known sequence patterns in comparative modelling, 
however, would require that the precise length and type of the loop are 
known a priori and this is rarely the case. Moreover, reconciling the 
results of different authors on their sequence / structure analysis is not 
trivial. One of the effects of the intrinsic irregularity of turns and loops is 
that their definition and nomenclature varies quite substantially. In 
different instances, the classification is based on their hydrogen bond 
pattern or on their main chain dihedral angles or on the distance between 
their end points (Chou and Fasmart 1977; Rose, et al. 1983; Sibanda and 
Thornton 1985; Milner-White, et al. 1988; Wilmot and Thornton 1988; 
Milner-White 1990; Wilmot and Thornton 1990). 

Another problem of all the methods based on sequence patterns is that 
they rely on the assumption that the structure of the loop is determined 
by local rather than tertiary interactions. This is not necessarily true and 
there are cases where it is clear that tertiary interactions can completely 
overrule the sequence pattern of the loop (Tramontano, et al. 1990). 
Medium-sized loops are even more difficult to predict, since the many 
possible combinations of their main chain dihedral angles make their 
classification less rigorous. 

In general they can be classified according to the type of interactions that 
stabilise them (Tramontano, et al. 1989). For loops that form compact 
substructures, the major conformation determinant is the formation of 
hydrogen bonds to main chain atoms of the loop. For loops having more 



extended conformations, the required stabilisation is obtained by packing 
an inward pointing hydrophobic side chain of the loop between the 
secondary structure elements connected by the loop. However, it appears 
that the loop dictates the interactions required to stabilise it, but in 
different proteins a variety of different topologies can be used to provide 
these interactions. This implies that this type of classification is not useful 
in deriving predictive rules. 

Data base searching is the default technique used in many model 
building experiments, especially for medium sized loops, and is based on 
the observation that segments of similar conformation occur in both 
related and unrelated proteins (Jones and Thirup 1986). 
Methods to predict loops by data base searching assume that some 
information about the structure of the loop is contained in the regions 
surrounding it, so that the latter can be used to identify the former. So far, 
only the information contained in the regions adjacent to the loop in the 
primary structure of the protein have been taken into account in this 
approach. This technique, based on the early work of Jones and Thirup 
(1986) (Jones and Thirup 1986) is generally included as a tool in a number 
of commercial and academic modelling packages (Dayringer, et al. 1986; 
Jones and Thirup 1986; Jones, et al. 1990; Vriend 1990). 
The basic method consists in searching the data base of solved protein 
structures for two regions closely matching the segments preceding and 
following the loop to be modelled ("stems") which are separated by the 



same number of residues as those forming the loop to be modelled (Fig. 
5.3). The assumption is that the structure of the loop is correlated to the 
structure of its "stem" regions. In order for this technique to be useful, 
three things have to be demonstrated. The first is that the loop to model 
exists in the data base, the second that similar loops have similar stems 
and thirdly that an equivalent geometric relationship exists between the 
stems and the loops in the modelled and template structures. 
The first assumption has been shown to be true 0ones and Thirup 1986; 
Tramontano and Lesk 1992), while the other two hypotheses have been 
shown not to be generally correct in a simulated model building 
experiment. 

The basic steps of the experiment (Tramontano and Lesk 1992) consisted 
of: 

1) searching the data base for loops similar in conformation to specific 
ones (immunoglobulin hypervariable loops were used for the test), in 
order to show that the loop could indeed be predicted by using a data 
base search. The result of this search showed that, with rare exceptions, it 
was possible to find structurally similar regions in the data base of known 
structures for all the selected loops, both among immunoglobulin 
structures and unrelated proteins, 

2) searching the data base for regions matching the stems of each loop 
separated by the appropriate number of residues, 

3) comparing the loops selected in step 2 with those selected in step 1. 



Such a simulation has a better chance of being successful than a real 
modelling experiment because, in the test, the stem structures are 
experimentally determined and not predicted, as would be the case in a - 
model building experiment. Nevertheless, while in some cases a good fit 
of the stems corresponded to a good fit of the intervening regions, in most 
cases a good fit of the stem did not imply a good fit of the loop or vice 
versa. There were also cases where, although both the fit of the stems and 
the loop were good, the geometric relationship between the two was 
different. 

These results indicate that, although in most cases loops of the desired 
conformation exist in non-homologous proteins, the information 
contained in the structure of the adjacent regions is not sufficient to 
identify them. Consistently with this observation, loop selection by data 
base searching seems to be able to provide a reasonable result only when 
homologous protein fragments are used, while the use of nonredundant 
fragment libraries remains problematic (Bates, et al. 1997). 
Other methods for loop structure prediction include molecular 
simulations, combinatorial searches and subsequent evaluation of loop 
candidates using conformational energy estimation and a combination of 
this latter method with data base searching techniques (Rose, et al. 1985; 
Moult and James 1986; Bruccoleri, et al. 1988; Martin, et al. 1989). 
Our understanding of the inter atomic-interactions is neither complete nor 
exact so that there would be serious computational difficulties in 



obtaining the correct evaluation of the energy of the protein's fragments, 
even if the rest of the model were exact. In a model building experiment, 
however, the part of the structure into which one is trying to build the 
loop is affected by an error, no matter what method has been used to 
predict it. This might seriously affect the result of a detailed free energy 
calculation. 

In other words, on one side the available forcefields (see below) are not 
sufficiently exact or complete to correctly evaluate the energy of different 
conformations. On the other, the end points of the loop regions are 
modelled and are consequently affected by an error which could strongly 
influence the results of a detailed energy calculation. Furthermore, it has 
been shown that pentapeptides of identical sequences can have 
completely different structures in different proteins (Kabsch & Sander, 
1984) so that any effective method to predict the folding of a segment of a 
protein needs to take into account its interaction with the rest of the 
structure in the calculations. 

Research in this field is extremely active, especially since it has been 
shown that incorrect loop predictions are the major cause of errors in 
models, but so far the conclusion is that available methods to predict loop 
conformations are able to identify the correct conformation only in a 
limited number of cases. 

From a practical point of view, however, it is advisable to critically 
evaluate whether it is really necessary to model the loops: while some of 



them are critical for protein functions, others are far from the regions of 
main interest of the model (e.g., binding or catalytic sites) and their 
prediction can therefore be omitted. 

ILe. Side chain modelling 

As the target and template proteins are assumed to be related, the 
conformation of the side-chains of the conserved residues should be 
modelled on that of the corresponding residues of template. A number of 
techniques are usually employed to build the conformations of the 
mutated ones. 

Some are based on 'rotamer libraries' that is a collection of the limited 
number of conformations that each amino acid side-chain preferentially 
assumes (Ponder and Richards 1987; Dunbrack and Karplus 1993; 
Schrauber, et al. 1993; Bower, et al. 1997; Ogata and Umeyama 1997) (Fig. 
5.4). 

These libraries can differ in the grouping of amino acid used to calculate 
the statistics of the rotamer distribution, for example by taking into 
account the local environment of a residue or its backbone angles. In some 
cases, these libraries, combined with some method to exclude rotamers 
producing unfavourable steric interactions (Keller, et al. 1995; Bates, et al. 
1997), are used to build all the side-chains of the model. 
Another commonly used method to build the side chains of non 
conserved residues is to import the conformational angles of the template 



up to where the" relative length of the two side-chains permits, using 
rotamer libraries for the remaining part of the chain (Tramontano 1995). 
Energy-based procedures are also used. Usually these procedures start 
their refinement from a model having the most common rotamers at every 
position (Rooman, et al. 1991; Thornton 1991; Eisenmenger, et al. 1993; 
Laughton 1994; Melo and Feytmans 1997). 

While methods for modelling side-chain conformation seem to perform 
rather well when given experimental co-ordinates for the backbone atoms, 
their accuracy is much lower for protein models and decreases rapidly as 
the r.m.s.d. between the model and the real structure increases (Martin, et 
al. 1997). This might indicate that an improvement in this area could be 
automatically achieved as a consequence of improvements in backbone 
modelling. 

III. Model refinement 

The steps described above usually lead to a (fairly) complete model. 
The next stage of the process consists in inspecting it, both visually and 
through the use of specific programs, to evaluate and optimise it. 
In principle, incorrect stereochemistry, unfavourable steric interactions 
and poorly packed regions have to be fixed. This can be done through 
manual interventions followed by few cycles of energy minimisation or 
geometric refinement. It is important to emphasise that neither method 
can substantially modify the starting model and consequently cannot 
correct large errors. 



Many attempts have been made to obtain a global refinement of a protein ; 
model using energy minimisation techniques or molecular dynamics. In 
blind tests, energy minimisation and molecular dynamics did not 
improve the models and often models built without any further 
refinement were closer to the real structures than those 'optimised' using 
various combinations of these methods (Martin, et al. 1997). 
Extensive energetic refinement, with whatever method, besides failing to 
improve the model, also has a major disadvantage. Regions of the model 
that are less reasonable frorn^a structural point of view, are likely to be 
those containing errors, due to either the initial alignment or the model 
building procedure used. Their optimisation, often obtained by moving 
atoms in the neighbouring regions, will hide this effect and will make 
every part of the model equally 'good'. For the model to be used 
correctly, it is important to be able to highlight 'problematic' areas of the 
structures that should be used with much more caution. 

IV. Expected accuracy of an homology model 

How reliable is a model? As we said before, the overall quality of models 
is highly dependent on the degree of similarity of the target with the 
parent structure and on the quality of the sequence alignment (Chothia 
and Lesk 1986; Hilbert, et al. 1993; Mosimann, et al. 1995; Martin, et al. 
1997). Both factors are related to the degree of sequence similarity and to 
the number of insertions and deletions between target and parent. 



In principle the relationship between sequence similarity and structural 
divergence could be used to evaluate a priori the accuracy of a model. 
However this is based on the assumption that the alignment is correct and 
that no improvement has been introduced by optimisation methods. 
It is therefore important to test the quality of models in blind tests and the 
field of protein structure prediction has received a new impulse from the 
CASP experiments. 

The impact of blind assessment experiments has been high in the field of 
comparative modelling. The technique has been used for so many years 
and in so many instances that it was taken for granted that serious 
progress in all its aspects was ongoing and that very good models could 
be built, even by using just automatic servers or programs. 
This is true for models of the core regions of proteins when the template 
is quite similar in sequence, but the results of the blind tests were far from 
exciting for medium to low homology models. 

In the first CASP experiment, no correct prediction of loops was obtained, 
the accuracy of side chain positioning was surprisingly low, the 
alignments contained many errors and some of the models were even 
stereochemically wrong (Mosimann, et al. 1995). 

Some progress was achieved in CASP2, where predictors were challenged 
to predict 9 targets with a sequence identity to the best template ranging 
between 20% and 85% (Martin, et al. 1997). 



One lesson from CASP1 was that automatic alignments should be refined 
and optimised manually and almost all groups did so in CASP2. One 
interesting example of how well this can work is illustrated below by the 
case of T28 (endoglucanase I). 
An automatic alignment in this case provides: 

TARGET : NG S PS GNLVS I TRKYQQNGVD I PSAQPGGDTISSCPS AS AY - - - GGL 

PARENT : SGAINRYYVQNGVTFQQPNAELGSYSGNELNDDYCTAEEAEFGGSSFSDKGGL 

while the correct alignment turned out to be: 

^ TARGET : NGS PSGNLVS ITRKYQQNGVDI PS - AQ PG - GDTI SSCP S AS AYGGL 

PARENT : S G - AI NR YYVQNG VT FQ - Q PNAELGS YS GNE LNDD YCT AEEAEFGGS SF-SDKGGL 

and the manually edited alignment by Samudrala et aL (Samudrala and 
Moult 1997): 

TARGET : NGS PSGNLVS ITRKYQQNGVDI PSA - -QPGGDTISSCP- - - S AS AYGGL 

PARENT : SGAINRYYVQNGVTFQQ PNAELGS YSGNELNDDYCTAEEAEFGGSSFSDKGGL 

The improvement is substantial in this case since the number of correctly 
aligned residues is 29 in the manually edited alignment and only 1 in the 
automatic one! 

It also became clear that the possibility of achieving a correct alignment is 
not only dependent on the sequence identity between target and 
templates, but also on the protein itself. In other words, while a high 
sequence identity almost guarantees a reasonable alignment, there are 
cases where similar sequence identity values (20% and 26%) produce 
alignments with a mean shift of 10 and 0.2 residues, respectively (Martin, 
et al. 1997). This specific example is for proteins pectate lyase and 



stellacyanin and seems to reflect both the type of protein (pectate lyase 
has a repeated structure and this might confuse alignment programs) and 
the number and distributions of insertion and deletions. 
In CASP2 some loops were finally predicted moderately well, and there 
was improvement in modelling side chains. 

The open issues highlighted by the assessment of comparative modelling 
techniques are however not very different from those of the first CASP: 
Alignment accuracy is still the major factor in determining the quality of a 
model, and its correctness below 25% sequence identity is still very low, 
the deviation between modelled and experimental loops is large, and it 
the main factor in the high values of total r.m.s.d., correct inheritance from 
the parent (in terms of distinguishing which parts can be copied into the 
model and which cannot) is the most important factor. The best models 
are those which deviate less from the parent structure. In other words, 
any attempt to model de novo regions of the protein or more sophisticated 
approaches which inherit their structures less directly from the parents 
seem to perform less well. 

It is important to note, though, that the most conserved regions in proteins 
are those that have an important structural and/ or functional role and 
these regions are often those modelled with higher accuracy; therefore, in 
spite of their possible shortcomings, models built by homology generally 
contain a wealth of practically useful information and are often 
instrumental in interpreting experimental data, in planning new 



experiments and in guiding the design of modified proteins (Sollazzo, et 
al. 1990; Savino, et al 1993; Orlandini, et al. 1994; Pizzi, et al. 1994; 
Scarborough and Dunn 4994; Amati> et al. 1995; Ammendola, et al. 1995; 
Helmer-Citterich, et al. 1995; De Francesco, et al. 1996; Failla, et al. 1996; 
Luo, et al. 1996; Jackson, et al. 1997; Komissarov, et al. 1997; Neddermann, 
et al. 1997; Starling, et al. 1997). 

V. Practical remarks 

Each step of the procedure to build homology models introduces errors 
because of the underlying theoretical issues, but there are also technical 
points that may seem more trivial, but are nevertheless equally 
responsible for errors and frustrations. 

The problems related to understanding and analysing the result of the 
data base search that needs to be performed to select the template have 
been discussed before. 

However, once the template has been selected, and a sequence alignment 
has been obtained, the alignment should be optimised using a number of 
criteria to take into account the available information about the template 
protein structure, the structural superposition of homologous structures 
and the multiple sequence alignment of the family of the target protein. 
The structural alignment of two protein structures is in itself a complex 
problem, and involves the selection of a number of parameters. 
In principle one would like to obtain the best structural superposition of 
two structures within a given r.m.s.d. threshold, but the problem does not 



necessarily have a unique answer. -Methods for structural superposition 
are becoming more clever (Taylor and Orengo 1989a; Taylor and Orengo 
1989b; Orengo and Taylor 1990; Rose and Eisenmenger 1991; Holm and 
Sander 1992; Orengo, et al. 1992; Holm and Sander 1993; Orengo, et al. 
1993; Orengo and Taylor 1993; Holm and Sander 1994; May and Johnson 
1994; Taylor, et al. 1994; Holm and Sander 1995b; Holm and Sander 1995a; 
Maiorov and Crippen 1995; May and Johnson 1995; Falicov and Cohen 
1996; Holm and Sander 1996a; Holm and Sander 1996b; Michie, et al. 1996; 
Orengo and Taylor 1996; Holm and Sander 1997; Orengo, et al. 1997; 
Singh and Brutlag 1997), but the user still has to decide what 'structurally 
conserved' means. 
A possible procedure is to: 

1. obtain a structural superposition of the two proteins, 

2. identify those regions that have the same secondary structure, 

3. least square fit of these zones, 

4. extend the selected zones at each end adding a residue at the time 
while the distance between equivalent C-? is less than a threshold, for 
example 3.0 A. 

This procedure is not easy to apply to more than two structures, and 
might produce somewhat different results according to which structural 
superposition method is used in step 1, which definition of secondary 
structure is used in step 2 and which threshold is selected in step 4, but 



can be considered as a useful starting point to analyse structural 
conservation. 

As far as the multiple sequence alignment of the target protein family is 
concerned, there are a number of sequence alignment editing tools which 
are quite useful. One practical suggestion is to use colour codes for each 
amino acid (a feature now provided by a number of programs) because 
this is usually very useful in detecting regions whose alignment can be 
manually improved. 

A number of parameters also have to be selected when modelling 

insertions and deletions by data base searching. The alignment will show 

a number of residues to be inserted/ deleted, but it is advisable to look in 

the data base for a region including at least one residue on each side of 

the insertion/deletion. In other words in a case such as 

template ADFRLADGTRCFRGT 
target LEFTLVD-SKCWKAS 
|stem| |stem| 

the most advisable thing would be to search for regions similar to Ihe 
stems and separated by 2 residues. Obviously the extent of the stems also 
has to be selected. Usual default values in modelling packages are about 
5 residues, however this parameter has to be selected on a case by case 
basis also taking into account the secondary structure of the stems. For 
example, if the preceding stem is in an ? helix, at least 45 residues 
should be used to guarantee that the matching regions would also be an ? 
helix, while for a ? strand a shorter region might be sufficient. In any 



case, it is always advisable to perform the search varying the dimensions 
of the stems to verify how much this affects the type of structures selected 
in the data base. 

Usually a number of regions are proposed as appropriate by modelling 
packages and it is not straightforward to decide which one to choose. 
Once again a number of rules of thumb can be given: 

1. The selected region should not be bumping into the backbone of the 
conserved regions of the protein or with G? atoms of surrounding 
residues. 

2. If the proposed loops have different conformations, the region which 
has matching pattern of glycines and prolines should be selected. 

3. Among similar structures, the one with a lower r.m.s.d. of the stems 
should be selected. 

After this step is completed, a modeller will encounter what may seem a 
minor problem (but is not considered so by whoever builds a model 
manually or by the authors of automatic modelling programs) and that is 
the handling of the Protein Data Bank entries and of their residue 
numbering. 

Records in a pdb file (Bernstein, et al. 1977) look like: 

ATOM 49 N ILE 

ATOM 5 0 CA ILE 

ATOM 51 C ILE 

ATOM 5 2 0 ILE 

ATOM 5 3 CB ILE 
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The reader should notice that residue numbers of entries in the PDB data 
base are not numbers but strings, containing characters and numbers, not 
necessarily consecutive and not necessarily starting from 1. 
Residue numbers can be followed by letters (184A). Very often this 
reflects the existence of a common residue numbering scheme for the 
protein family. The example reported here is in fact that of trypsin and 
serine proteases usually follow what is called the 'chymotrypsin 
numbering scheme'. Since the PDB entry 3tpi (Huber, et al. 1974) contains 
both trypsin and its proteic inhibitor, the enzyme is identified by a chain 
(Z in this case). 



Also note in this example that residues 16 and 17 have occupancy 0, that 
is they cannot be seen in the structure. A footnote in the file in fact 
explains that: 

AN OCCUPANCY OF 0.0 INDICATES THAT -NO SIGNIFICANT. DENSITY WAS 
FOUND IN THE FINAL FOURIER MAP. 

In the same file there are notes such as: 

THERE IS NO SIGNIFICANT ELECTRON DENSITY IN THE FINAL 
FOURIER MAP FOR THE N-TERMINUS OF THE ZYMOGEN FROM VAL 2 10 
THROUGH GLY Z 18 AND THIS DATA ENTRY CONTAINS NO 
COORDINATES FOR VAL Z 10 THROUGH LYS Z 15. 

which indicates the importance of reading the remark records of the entry. 
In general, modelling programs will preserve the original residue 
numbering of any loop selected in a data base search, will not verify 
whether the selected region has all atoms with full occupancy and will not 
record the original occupancy values. After a few insertions and deletions, 
the issue of finding the regions corresponding to a given part of the 
alignment in the structure will become quite annoying and there will be 
no way to recover the original occupancy values or any of the authors' 
remarks for the regions used to construct the loops. This implies that the 
original entry of each of the PDB files used in a modelling experiment 
should be manually checked, in particular with respect to occupancy and 
remarks of the authors. 

Another issue that can arise both during the selection of the template and 
of the putative loops is that of having to use NMR structures. NMR 



produces an ensemble of structures, a number of which are- usually 
included in the PDB entry together with an 'average structure'. Care 
should be taken in understanding which structure is being used and what 
is the extent of variation of the structure in the different models of the 
molecule. 



Figure legends 

Fig 5.1 Number of protein structures deposited each year into the 
Brookhaven Protein Data Bank (PDB) (Bernstein, et al. 1977). 
Fig. 5.2 Types of turns in protein structures, a) Beta turns. Type I and I' 
are shown on the top, Type II and IF in the middle and Type III and III' on 
the bottom, b) Gamma turn and Inverse gamma turn, c) An example of an 
omega turn. 

Fig. 5.3 Example of loop prediction by data base searching. Left: backbone 
of a putative template structure with a 6 residue long loop. Right: C- 
alpha trace of the template loop and of two loops of length 3 selected 
because their stems (5 residues before and 5 residues after the loop) are 
very similar to the stems of the template structure. 

Fig 5.4 The four most common rotamers of tyrosine. They are shown after 
optimal superposition of the backbone. 
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IMPLICATIONS FOR PROTEIN DESIGN 

As discussed above, the folding code is degenerate, so that many proteins, even when they 
do not share any significant sequence homology, adopt similar folds. This implies that 
designing a sequence able to adopt a given fold should in principle be easier than 
predicting which fold a given sequence will assume. 

The de novo design of proteins has in fact met with success albeit with some limitations 
[144, 145, 4, 6, 146, 7, 8, 147]. To refine the process and improve our understanding of the 
rules that govern protein folding a major effort has also been devoted to the detailed 
structural characterisation of such designed molecules [5, 148, 149, 9, 150, 12]. However, 
because our understanding of the protein folding code is still fairly rudimentary the option 
of recruiting known protein folds as frameworks for the insertion of functional sites, or the 
modification of existing enzymatic activity, has proved to be more viable and has attracted 
the most attention [151, 10, 11, 14] (Fig. 2). 

The three major challenges that protein design has to face are: i) is the designed sequence 
compatible with the desired fold? ii) is the selected fold the most favourable for that 
sequence? iii) does a folding pathway for the fold exist? 

The potential function developed for fold recognition and secondary structure prediction 
methods should be able to help to answer the first question [152]. Fold recognition 
techniques and folding simulations might become reliable enough to set the other issues so 
that the new tools developed for protein structure prediction are giving a new impulse also 
to the field of protein design. 

Being able to design proteins for specific functions would have a tremendous impact in 
many areas of biology and medicine, but would also represent a key step in our 
understanding of the rules relating sequence to structure in proteins. Native proteins are a 



biased sample of the set of possible solutions of the folding problem, since they have been 
obtained through an enormous number of steps of evolution and selection and since they 
have constraints imposed by their function and interactions. Designed proteins could allow 
us to better highlight those properties of protein structure which still escape our 
understanding. 

TRENDS AND FUTURE PERSPECTIVES 

It has been suggested and shown in many instances [153, 154] that the combined use of 
prediction results coming from different methods and of experimental data is one way to 
improve the quality of the final model of a protein. 

A major issue is therefore to be able to compare these data, which are of different type and 
dimensionality and to verify which model or part of a model is consistent with the larger 
subset of them. 

In our laboratory; we have developed GLASS, a novel tool to address this issue. The system 
we have implemented is a general platform to read, visualise, project into three-dimensions 
and compare the results of different structure prediction methods. It also allows to assess 
the consistency of the model(s) with experimental data and to compare selected parameters 
calculated for a model with the distribution observed in real protein structures (Fig. 3). 
A development version of GLASS was used during the IRBM structure prediction practical 
workshop [153, 154] on a set of target proteins where it was found to be extremely useful 
both to compare the results from different prediction methods and to map known 
experimental data onto the putative models by all the participants. 

We feel that this system is equally needed by the users of the many different prediction 
methods available and by theoreticians who can use it as a workbench to rapidly test new 
ideas for evaluating the likelihood of different models and is likely that predicting a protein 
structure with techniques other than homology modelling will become faster and more 
reliable as more tools for evaluating alternative models and to automatically check the 
consistency of different results will be added to systems such as GLASS. 
The big challenge is whether the automation of both the prediction and evaluation 
procedures will allow to create a 'prediction database' containing predictions at different 
levels of detail for any sequence. Given the large number of sequences of unknown structure 
being generated by the genomic sequencing projects, automation of the prediction and 
evaluation steps, and the possibility of making the results automatically available via the 
Internet would provide a valuable resource for theoreticians and experimentalists, and 
might be a key step in predicting the function of a protein sequence, and understanding its 
mechanism of action, which is after all, the final goal of protein structural studies. 
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Figure legends 



Fig. 1: Modeling by homology of the variable domain of antibodies: superposition of the 
model structure of antibody DB3 to the crystal structure [155]. The model and the structure 
are cyan and green in the framework region, yellow and orange in the H3 region red and 
magenta in the other hypervariable loops, respectively. Only the C, C? and N atoms of the 
main chain are shown. 

Fig. 2: Designed structure of the Minibody: the design was based on the structure of the 
variable domain of an antibody of known structure. The 61-residue minibody protein 
includes three ? -strands from each of he two ?-sheets of the variable (V) heavy chain 
domain of the mouse antibody McPC603 [156], along with the segments corresponding to 
the exposed hypervariable HI and H2 loops of the immunoglobulin as defined by Chothia 
and Lesk [61]. A metal binding site, shown in the Figure, was designed in the molecule in 
order to probe its proper folding. 

Fig. 3: a) Window of the fold recognition analysis tool in GLASS. The output of a fold 
recognition program is reformatted and analysed. Each fold on the data base, shown with 
its respective score can be selected to create a starting model b) The model can be displayed 
using RasMol (Sayle, R.: Rasmol. 1994; See <URL: 
http://www.glaxowelcome.cs.uk/netscape/software/>. The ribbon represents 
the variability of the sequence calculated from a multiple sequence alignment: the more 
variable residues have a larger red ribbon, the more conserved have a small blue ribbon. 
Dotted lines represent predicted correlated mutations. GLASS allows to map experimental 
data on a predicted structure by user defined coloured balls and/ or labels, c) View of the 
multiple sequence alignment used to generate the ribbon in (b), coloured according the 
hydrophobicity of the residues. 
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Abstract 

Logistic regression (LR), discriminant analysis (DA), and neural networks (NN) were used to 
predict ordered and disordered regions in proteins. Training data were from a set of non-redundant 
X-ray crystal structures, with the data being partitioned into N-terminal, C-termmal and internal 
(I) regions The DA and LR methods gave almost identical 5-cross validation accuracies that 
averaged to the following values: 75.9 ± 3.1% (N-regions), 70.7 ± 1.5% (I-regions), and. 74.6 ± 
4 4% (C-regions). NN predictions gave slightly higher scores: 78.8 ± 1.2% (N-regions), 72.5 ± 
1 2% (I-regions), and 75.3 ± 3.3% (C-regions). Predictions improved with length of the disordered 
regions Averaged over the three methods, values ranged from 52% to 78% for length = 9-14 to > 
21 respectively, for I-regions, from 72% to 81% for length = 5 to 12-15, respectively, for N-regions, 
and from 70% to 80% for length = 5 to 12-15, respectively, for C-regions. These data support the 
hypothesis that disorder is encoded by the amino acid sequence. 

1 Introduction 

The current paradigm is that protein function depends on 3D structure [10, 16, 18], yet some proteins 
are partially or completely unfolded in their native states [2, 3, 7, 24, 26]. For such "natively un- 
folded" [30], "natively disordered" [9] or "intrinsically unstructured" [31] proteins, the lack,of.a.nxed 
^structure can be an integral part of the function. Are such disordered proteins common or rare? 

To estimate the commonness of disordered proteins, we applied predictors of disorder to appropriate 
databases [20]. The results suggested that intrinsic disorder is common [21], but lack of structural 
information limits confidence in these findings. Since the needed structural information will be slow 
in coming, we are revisiting the question of commonness by improving our . disorder^predietions. 

A limitation of our previous studies was that only neural networks (NNs) were tried. By comparing 
NNs with discriminant analysis (DA) and logistic regression (LR), we can gain additional confidence 
in the suitability of prediction for identifying ordered and disordered protein. 

Technical limitations of our previous algorithms resulted in absence of predictions on 15 residues 
at each end [20], resulting in non-prediction of a significant fraction of the residues. Here we modified 
the algorithms to extend the predictions to the N- and C-termini. 

2 Materials and Methods 
2.1 Data 

Using missing electron-densitrin X-raf structures as indicating disorder [19], we identified 115 N- 
terminal, 84 C-terminal and 69 internal (I) disordered regions (DRs) that were contained in 197 



unrelated proteins listed in PDB-select-25 [11]. The minimum lengths used were.S-and 9ior termini 
and I-regions, respectively- The various DRs contained the following numbers of residues 1,644 (N- 
regions), 1,347 (I-regions) and 1,250 (C-regions). A set of 130 unrelated, disorder-free proteins that 
were also from PDB-select-25 [11] was used to generate the ordered residues used for predictor training. 

2.2 Attribute Generation 

Composition-based and property-based attributes were calculated over sliding windows [20, 32] . A 
total-51 'attributesr-were^examined, where the sets of amino acids represented some property such as 
aromaticity, charge, sheet formers, etc (Table 1). 

Table 1: Attributes list. 
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Coordination number 


X13 


GSA 


X26 


SGKPDE 


X39 


N 





Composition-based attributes were the sums of the numbers of the indicated amino acids in a given 
window. For^example,,arQmatiGity, XI = FWY, the number of phenylalanines (F) + tryptophans (W) 
+ tyrosines (Y) within a given window. The number of histidines was sometimes divided by 2 (e.g. 
H/2) due to its small ring size or partial charge. For the net charge attributes, X3 and X4, the number 
of each negative residue was subtracted (e.g. -D, -E) from the number of positive residues. 

Property-based attributes were the sums the residue property-values. For X49 = flexibility, the 
value for each residue was based on its backbone-atom B-factors averaged over 92 unrelated protein 
structures [28], The values for X50 =1 hydropathy were from the Kyte-Doolittle scale [15]. X51 = 
coordination number is the average number of side chain neighbors that are in contact with the given 
side chain when it is fully buried as determined from a set of 33 non-homologous proteins [8]. 

As in previous studies [20], a window of 21 was used for I-regions. A window of 11 was used for 
positions 6 onwards and for -6 backwards for N- and C-regions, respectively. Predictions at positions 
1 to 5 and -1 to -5 used windows of size 6 to 10, respectively For N-regions,4hese- windows included 
residues4romJ;he iV end .to^ 
end,to-5 .p.oaij^nsjb^^g. 

2.3 Logistic Regression Model and Attribute Selection 

The logistic regression (LR) model was developed for dealing with the situation in which the dependent 
variable is binary [5]. Here we used order = 0, and disorder = 1. SAS (Release 6.12, SAS Institute, 



Cary, NC) was used for the calculations. 

For a given threshold probability, an observation is classified into the category with the probability 
higher than the threshold. In the logistic model, the probability is estimated from the following 
equation: 

ln[-^—\ = b 0 + hXn + b2X i2 + • ■ • + bjXij 
l l-p J 

where p = P [Y { = 1 for ordered) and 1 - p = P (Yi = 0 for disordered ); i = 1,2, • • * ,n, where n 
is the sample size; j = 1, 2, . . . , m, where m is the attribute number; and Xn z . . . , are attributes 
used for prediction. 

The parameters bi are estimated by maximizing the following function: 

±P { B,Y i ) = £ln(— * ) 

where 5 is the vector of parameters need to be estimated. After all bi values are estimated, p can be 
calculated as: ^ 

For order = 0 and disorder = 1, the threshold is set to be 0.5; if p > 0.5, then the amino acid is 
predicted to be disordered; otherwise, ordered. The LR is applied each time an attribute is introduced 
or removed, and the Chi-square test is executed [1]. The process is repeated until introduction or 
removal of an attribute leads to no change at a significance level of 0.05. Eight selected attributes 
were used in LR predictor even though a few more number passed the significance test. 

2.4 Discriminant Analysis Model 

For discriminant analysis (DA), it is assumed that prior probabilities are equal, that the variables 
(attributes) are independent, and that all attribute values satisfy the normal distribution. Since we 
used sliding windows to obtain data and since many of the attributes share dependencies on the 
same amino acids, the assumption that the data are independent is not true. However, this lack of 
independence didn't seem to cause serious problems since this approach gave results comparable to 
the other methods in this study. Again, SAS (Release 6.12, SAS Institute, Cary, NC) was used to 
carry out the calculations for this model. 

For the ordered and disordered data x = {xi, w}, t = 1, • ■ ■ , n; W = {0, 1}, where 2/ = 0 for an 
ordered amino acid, y = 1 for a disordered one. The x t values are the attributes data. We used 
Bayesian discriminant analysis method to predict the probability that a given amino acid belongs to 
an ordered or disordered regions. The posterior (conditional) probability that a residue belongs to an 
ordered or disordered region is given by the following equation: 

ric I ,) P[x 1 D)P{D) 

| x) - p ^ j D)p(£>) + p(x i 0 )P(0) 

where j = 0 (ordered) or 1 (disordered); P(0) and P(D) are the a priori probabilities of a residue being 
ordered and disordered residues, respectively. P{x | D) and P{x \ O) are the conditional densities of 
disordered and ordered residues, respectively. P(Cj \ x) is given by the following relationship: 

P ( C J I x ) = X:jtLie^o + b' k x) = l + e(^o-6 o0 )+(b d +b 0 )'x 

Using observed data, the parameters b d o and b^ and the vectors b d and b Q can be estimated. The 
classification for a given pattern x is determined as: Class = argmax{P(Cj | x)}, where class is 0 or 
1 for ordered or disordered, respectively. 



The attributes were repeatedly introduced or removed, and the F-test was applied after each 
operation, until no attributes could be introduced or removed at a significance level of 0.05 [6]. The 
top eight selected attributes were used for establishing the DA predictor even though a greater number 
were accepted at the significance level indicated. 

2.5 Neural Network Model 

The application^df 'NNs^tWfder/disorder prediction has been described elsewhere in more detail [20]. 
The feed forward NN used in this study is fully connected with an 8x8x1 architecture, which has eight 
inputs (selected by LR), one hidden layer with 8 nodes and one output layer with one node. The back 
propagation method was used for data training [23].' 

3 Results 



3.1 Attribute selection 

A list of 51 attributes was used in this study (Table 1). Many of the attribute values are correlated. In 
addition, some attributes,make4ittle contribution^in^distmguishing4he ordered and disordered regions. 
Finally, 51"* attributes is simply too many for the amount of disordered data. These characteristics 
necessitated the selection of a subset of the attributes for the predictors! 

Stepwise DA and stepwise LR were used for attribute selection on ordered and disordered "data 
from the N-, C- and I- regions. Although more than attributes were selected for the data at a 
significance level of 0.05, the ninth and later selected attributes make relatively little contribution, as 
shown by the prediction accuracy upon addition of attributes in their order of importance (Fig. 1). 
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Figure 1: Contribution of selected attributes on prediciton. 

The selected attributes in Table 2 start with the most important. For the' top 8-sequencrattribtites 
in a given protein region, the DA and LR models selected almost the same ones. That is, 5/8, 6/8, 
and 8/8 attributes were selected in common by the two methods for the N-, I-, and C-region data, 
respectively. In contrast, the selected attributes were very different for the different regions. Only 1 
sequence attribute was selected in common for all three regions. For the 3 pairs of regions, only 4/8 
were selected in common for N- and C-regions, just 3/8 for the N- and I-regions, and a mere 2/8 for 
the C- and I-regions. These results suggest that sequence characteristics leading-to-disorder depend 
oa.J&&i<H!Stion-o£the^^ 



Table 2: Attributes selected according to the significance in DA and LR. 



Attributes 


1 


2 


3 


4 


5 


6 


7 


8 


DA: N-terminal region 


X25 


X38 


X51 


X34 


X20 


X35 


X31 


X39 


LR: N- terminal region 


X25 


X38 


X51 


X34 


X30 


X45 


X48 


X39 


DA: Internal region 


X49 


X42 


Xll 


X34 


X43 


X31 


X40 


X35 


LR: Internal region 


X49 


X42 


X7 


X14 


X43 


X31 


X40 


X35 


DA: C-terminal region 


X51 


X34 


X42 


X25 


X38 


X50 


X44 


X48 


LR: C-terminal region 


X51 


X34 


X42 


X25 


X38 


X50 


X44 


X48 



3.2 Prediction Accuracies 

The prediction accuracies of the 3 models over the 3 regions are given in Table 3. The DA and LR 
models gave almost identical accuracies for each region, with the largest difference being 0.3% (for 
I-regions). Also, using the N-regions as an example, the 0.1% difference between the two methods is 
much less than the ± 3.5% and ± 2.7% variation among the 5-cross validation trials. Thus, the DA 
and LR models give essentially indistinguishable prediction accuracies overall. 

Table 3: Five-cross validations of the-predictors-developed-bythree-methods. 



Model 


Region 


1 


2 


3 


4 


5 


Average 


Neural 
Network 


N region 
I region 
C region 


79.0% 
72.2% 
75.1% 


78.8% 
72.6% 
75.5% 


78.7% 
73.1% 
74.9% 


78.9% 
72.2% 
74.4% 


78.7% 
72.4% 
76.5% 


78.8% (±1.2%) 
72.5% (±1.2%) 
75.3% (±3.3%) 


Discriminant 
Analysis 


N region 
I region 
C region 


74.2% 
70.1% 
72.7% 


78.4% 
71.3% 
71.6% 


75.9% 
70.0% 
77.0% 


73.7% 
71.8% 
76.3% 


77.2% 
71.1% 
75.9% 


75.9% (±3.5%) 
70.9% (±1.4%) 
74.7% (±4.1%) 


Logistic 
Regression 


N region 
I region 
C region 


74.0% 
69.6% 
72.0% 


77.3% 
70.62% 
71.3% 


76.3% 
69.8% 
77.3% 


74.2% 
71.7% 
76.6% 


77.2% 
71.4% 
75.5% 


75.8% (±2.7%) 
70.6% (±1.6%) 
74.5% (±4.7%) 



The NN approach gives slightly higher predictions for all three regions. In the following, the first 
number in each pair is the NN accuracy and the second number is the average of the DA and LR 
accuracies: 78.8 ± 1.2% versus 75.9 ± 3.1% (N-regions), 72.5 ± 1.2% versus 70.7 ± 1.5% (I-regions), 
and 75.3 ± 3.3% versus 74.6 ± 4.4% (C-regions). 

3.3 Cross Prediction 

Each predictor was applied to the data from the regions not used for its training, here called cross 
prediction. In Table 4 accuracies observed during 5-cross validation (indicated by *) are compared 
with the accuracies for cross predictions (no *). For the most part, as expected, the accuracy of a 
given predictor drops when applied to the data from a region different from its training set. However, 
for both the LR and DA models trained on I-regions, the accuracies remain essentially the same when 
the predictors are applied to C-region data. That is, the LR model only changes from 70.6% on its 
I-regions training data to 70.9% when applied to C-region data, and the DA model, from 70.9% to 
71.2%. This failure to drop in accuracy is especially surprising since I- and C-regions predictors share 
just 2/8 attributes. 
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Table 4: Cross-prediction specificity for disordered regions. 



Predictors 


Region 


N-terminal Data 


Internal DR Data 


C-terminal Data 


Discriminant Model 


N-region 
Internal region 
C-region 


75.9%* 
64.9% 
71.3% 


52.9% 
70.9%* 
68.8% 


61.5% 
71.2% 
74.7%* 


Logistic Model 


N-region 
Internal region 
C-region 


75.8%* 
66.3% 
71.6% 


44.6% 
70.6%* 
68.9% 


57.6% 
70.9% 
74.5%* 



3.4 Length dependence of prediction accuracy. 

To estimate accuracy versus length, the prediction outputs were partitioned according to length with 
the number of residues in each class indicated in parenthesis (Table 5). For the DA and LR predictions 
in Table 5, the models from 5-cross validation were retrained on 5/5 of the data, whereas for the NN 
predictions, retraining on the whole set of data was not performed. Instead, one of the NN models, 
which was trained on 4/5 of the data, was used. For DRs of 9 to 14, the roughly 52% accuracy 
(averaged over the 3 methods) corresponds to essentially random classification. For DRs of 15 to 20, 
the average accuracy increased to 74%, and for DRs > 21 the average increased still further to about 
78%. Since the windows are 21 in length, the shorter DRs fill only a fraction of their windows, and 
therefore the poor accuracies are expected. 



Table 5: Prediction accuracies for different I-DR lengths. 



Predictors 


9-14 AA (379) 


15-20 AA (262) 


21AA or longer (707) 


NN 


52.8% 


73.7% 


78.6% 


DA 


50.9% 


74.4% 


77.9% 


LR 


52.2% 


74.4% 


78.2% 



The lowered prediction rates due to the short disordered windows probably helps to explain the 
surprising cross prediction results that occur when the predictors trained on I-regions are applied to 

C-region data as described above. 

The-N-and-C-region.data-also.show4engthcdependentaccuFacies (Table 6). For N-region data, 
the accuracies, averaged over the three methods, change from 72% (length = 5), to 83% (length = 
6-8), to 77% (length = 9-11) to 81% (length = 12-15). For C-region data, the respective averaged 
accuracies are 69%,' 78%, 72% and 80%. 



Table 6: Prediction accuracies for different N- or C-DR lengths. 



DR 


Predictors for 


DR=5 AA 


DR=6-8AA 


DR=9-11 A A 


DR=12-15AA 


Regions 


N and C regions 


(N:60; C:65) 


(N:269; C:117) 


(N219; C135) 


(N:137; C:163) 


N 


NN 


75.0% 


83.6% 


77.1% 


86.0% 




DA . 


71.7% 


83.3% 


78.1% 


81.0% 




LR 


70.0% 


82.2% 


76.3% 


77.4% 


C 


NN 


70.5% 


73.1% 


74.2% 


85.2% 




DA 


67.7% 


74.4% 


63.0% 


75.5% 




LR 


67.7% 


74.4% 


63.0% 


76.1% 



3.5 Position- by-posit ion accuracy for N- and C-regions 

The position-by-position error rates were determined; all three predictors give similar outputs that 
result in fairly complex curves (Fig. 2). The data in Fig. 2 are incommensurate with the data in Table 
6, so these should not be compared directly. This is discussed below in more detail. 
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Figure 2: Prediction accuracy over AA positions in N- and C- regions. 



4 Discussion 

4.1 Data 

Disorder^characterized«by-X-ray-dirTraction can be static or dynamic [13]. In our previous studies we 
attempted to remove this ambiguity by finding independent-information such as protease sensitivity 
or NMR spectra [20], but most often the information was lacking. As an alternative, we compared 
X-ray-characterized disorder with disorder characterized by other methods especially NMR [9]. The 
results indicate that ambiguity ^of-X-ray-eharacterized*disoFde^ 
introduction-oLnoiseJnto the -training- data? 

4.2 Comparison of Prediction models 

There is no single best algorithm for pattern recognition problems. Performance for a given algorithm 
depends on the data set being investigated [14]. DA, LR and NN approaches are among the most 
commonly used, and all have been applied to sequence analysis problems. DA has been successfully 
used for predicting internal exons of DNA sequences [25] and protein secondary structure [27, 33]. LR 
has been used for identifying regulatory regions of genes [29]. NNs have also been used for predicting 
secondary structure [22]. Considering the characteristics of the three methods, we decided to try all 
of them in this study. 

The LR and DA models exhibited nearly identical performance for the disorder predictions whereas 
the NN gave a slightly higher accuracy (Table 3). Application of Cochran's test [4] indicates a real 
significance for the superiority of the neural network. However, prediction accuracy is a simplistic 
indicator, so it seems inappropriate to rank the methods on this basis alone. 

Olson [17] reported that, with proper selection of attributes, both statistical and neural network 
classifiers yield essentially identical accuracies for a given test case. From this, there are two implica- 
tions that arise from the possible superiority of the NN predictors. First, other factors not included 
in Table 1 might affect the determination of order or disorder. To test this, other attributes need to 
be investigated. Alternatively, the predictors might not be optimized. 

DA is fast and performs well except for very skewed data [14]. LR was developed for binary data 
and so might be the most robust for predicting two states, order and disorder. DA and LR methods 
need much less computation time than NN, and produce results that are easier to interpret. 

Back propagation NN, in most cases, performs well especially for noisy data. Noisy data is of par- 
ticular concern due to the ambiguity-of-X^ray-characterized disorder. With appropriate architecture, 



a back propagation neural network can be a universal approximator for arbitrary finite inputs [12]: 
No assumptions are required for the input and output parameters. 

There are some general disadvantages, however, in using NNs. For example, the selection of the 
architecture (number of layers, number of neurons) is empirical. If too few hidden neurons are used, 
training convergence is often poor, whereas if too many are used, the network might converge well, 
but generalization is typically poor. A further shortcoming of NNs is the failure to provide insight. 
That is, there is no deterministic way to carry out attribute selection. For these reasons, we carried 
out an entirely separate study to gain understanding of our problem [32]. A significant advantage of 
the LR and DA methods is the ability to carry out step-wise addition of the various attributes. 

4.3 Attribute Selection 

Both our previous studies and the studies on I-region data presented here used windows of 21 residues. 
Despite the very different databases in the two studies, the previously selected attributes closely 
resemble those reported here. That is, 6 of 8 attributes were selected in common by the LR and DA 
methods; these were X49 (flexibility), X42 (R), X43 (S), X31 (E), X40 (P), and X35 (I) as shown in 
Table 2. Of the 6 attributes in common, 5 were selected in our previous studies on completely different 
databases of order and disorder; only the last, and least important attribute found here, X35, was not 
selected previously. Of the 4 attributes not selected in common, e.g. XI 1 (VILM) and X34 (H) by 
DA and X7 (WFYC) and X14 (WCFIYVLHM) by LR, all are identical to, or share amino acids with, 
attributes selected previously on completely different data. \ 

The prediction of order or disorder for I-regions depends on a balance of different types of attributes. 
X49 (flexibility), X42 (R), X43 (S), X31 (E), and X40 (P) are attributes that, at high value, favor 
disorder, whereas X35 (I), Xll (VILM), X7 (WFYC), and X14 (WCFIYVLHM) all favor order. 

This is the first study of the relationship between amino acid sequence and disorder at the ends of 
proteins. Comparing attributes for N- and C-regions with each other and with attributes for I-regions 
provides insight regarding disorder at the ends of proteins. 

Although just 4/8 attributes are in common between the two ends, these include the top two 
attributes for each (Table 2). That is, the top two attributes, X25 (VIYFW) and X 38 (M), for N- 
regions data rank fourth and fifth, respectively, for C-regions data. Also, the first, X51 (Coordination 
Number), and second, X34 (H) for C-regions rank third and fourth, respectively, for N-regions. From 
Fig. 1, these top attributes are the most important. Of the attributes specific for each end, some of 
these contain residues with charges opposite to the charge at the termini (Table 2). For example, the 
positive charge at the N-terminus is opposite to the negative charges (E and D) in X20 (WYFEDH) 
and to that of X31 (E). Likewise, the negative charge of the C-terminus is opposite to the positive 
charge of X42 (R). 

The attributes selected ^ can for the most part be described as being 

associatedjvitkto^^^ whereas the attributes selected for I-regions appear 

to be more balanced betweenattributes favoring order and those favoring disorder. Even the charged 
attributes, X31 (E), and X42 (R), which are associated with disorder in I-regions, are selected at the 
ends in^manner^hat ^ could be promoting order -iirthe^fSgieBs. 

Perhaps I-regions are neutral with respect to order or disorder, whereas perhaps■■N^an*'e^egicmrt^^d• 
to<>be,naturaUy~disordered. If so, order or disorder in I-regions is determined by the overall balance of 
various types of attributes, whereas overcoming ^natural disorder tendency, at . the ends may< require 
the presence of order-inducing-amino acids in these-regions. 

4.4 Prediction accuracies 

If only the longer I-regions data are considered, the estimated accuracy here (Table 5) is slightly better 
than we found earlier. That is, here we find about 78% (average of DA and LR) versus about 73% - 
74% (NN) reported previously [20]. The slight improvement probably relates to the increased number 



of attributes surveyed, 51 here versus 24 previously. More specifically, only single amino acids were 
used in the original study, whereas the expanded set used here contains combinations of amino acids. 
Several of the selected combinations include groups of the single amino acids selected in the original 
study, thus creating space for additional inputs that bring more information to bear on the problem. 

The length-dependence of I-region predictions shows a very large gradient, from almost random 
predictions (near 52% averaged over the three methods) for length = 9-14 to fairly strong predictions 
(about 78% averaged over the three methods) for length > 21. Because windows of 21 were used, 
the shorter lengths only partially filled the windows and so the essentially random predictions are a 
reasonable outcome when the disorder training examples contain large amounts of order. 

Here we report our first attempt to predict to the ends of the protein. We included down to very 
short DRs (5 amino acids) with the expectation that we would find some minimum length below 
which the predictions would fail completely. Such failure would give random predictions like those 
observed for the shortest I-regions data, although for different reasons. To our surprise, even DRs as 
short as 5 amino acids at the ends yielded good prediction accuracies, about 72% (N-regions) and 70% 
(C-regions) when averaged over the three methods (Table 6). Although not monotonic, increases in 
accuracy reached 82% (N-regions) and 80% (C-regions) for DRs of length 12-15. These high values 
suggest the possibility of special effects at the ends of proteins. 

The NN, LR, and DA methods give similar curves for the position-dependent accuracies at each end 
(Fig. 2), with high value followed by minima that are very noticeable for the N-region curves and barely 
noticeable for the C-regions curves. The causes of these minima near positions 9-10 are uncertain. 
One possibility is that windows at the 9-10 positions for the disorder data contain substantial fractions 
of ordered residues, resulting from a combination of the distribution of disorder lengths in the training 
data and the way in which the windows were specified. Based on this idea, we are exploring alternative 
window specifications with the goal of reducing these minima. 

The data in Fig. 2 were grouped differently from the data in Table 2. This leads to false discrep- 
ancies such as the > 80% accuracies for positions 1-5 (N-regions, Fig. 2) which appear to be better 
than the 72% accuracy for N-region DRs = 5 A A (averaged over the 3 methods from Table 6). The 
false discrepancy arises because the data for Table 6 come from the specified lengths whereas the data 
for Fig. 2 are predictions at particular positions from DRs of all different lengths. So, the higher 
accuracy of > 80% for the first 5 positions results from contributions from DRs longer than 5, which 
yield predictions over the first 5 positions better than the 72% observed for DRs of length = 5. 

4.5 Implications for Future Research 

The high accuracy of prediction of very short DRs at the termini might be special, due to end effects, 
or the high accuracy might be simply the result of the use of very short windows. If the latter is true, 
then use of shorter windows might be of benefit for I-region predictions as well. 

A second task will be to merge our various predictors into one, making it possible to predict 
disorder from the amino to the carboxyl terminus of a protein. This will open the way for a variety 
of projects, such as improving predictions of disorder on a genomic basis and such as using disorder 
predictions to indicate which proteins are likely to crystallize and which ones are not. 
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